Re: UTF-8 to EBCDIC

2002-07-31 Thread Doug Ewell

Two steps are necessary here:

1.  Decode UTF-8 to Unicode scalar values.
2.  Look up the Unicode scalar values in the table referenced by Magda,
and find the corresponding CP037 code point.

We can help with either of these steps.  Contact us, either on the list
or privately, if you need assistance.

-Doug Ewell
 Fullerton, California


- Original Message -
> > From: Vishweshwaraiah, Balasubramanya
> > [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, July 30, 2002 2:52 PM
> > To: Magda Danish (Unicode)
> > Subject: RE: Web Form: General question
> >
> >
> > Magda Danish,
> > Thanks a lot for your interest in helping me by giving suggestions.
I
> > visited the site you mentioned in your reply. I didn't get
> > any idea about
> > how to do the conversion from UTF-8 to EBCDIC or I may be
> > thinking in wrong
> > direction.
> >
> > All I understood from that site is the equivalent ebcdic
> > code(cp037) for
> > each Unicode character.
> >
> > If I have a UTF-8 formatted file, how does this table helps
> > me to do the
> > conversion?.
> >
> > Kindly advice me if I misunderstood any thing.
> >
> > Thanks
> > Balu.
> >
> >
> >
> > -Original Message-
> > From: Magda Danish (Unicode) [mailto:[EMAIL PROTECTED]]
> > Sent: Thursday, July 25, 2002 3:27 PM
> > To: [EMAIL PROTECTED]
> > Subject: FW: Web Form: General question
> >
> >
> > Have you tried looking at
> > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/
> >
> > Best Regards,
> >
> > Magda Danish
> > Administrative Director
> > The Unicode Consortium
> > 650-693-3921
> >
> >
> >
> > > -Original Message-
> > > Date/Time:Thu Jul 25 12:51:15 EDT 2002
> > > Contact:  [EMAIL PROTECTED]
> > > Report Type:  General question
> > >
> > > Hi,
> > >
> > > I need to convert UTF-8 format files to EBCDIC. Could you
> > > please suggest me any available tool that does this
> > > conversion or How to do this conversion.
> > >
> > > Thanking you lot in anticipation of quick response.
> > >
> > > With Best regards
> > > Balu.
> > >
> > > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > > (End of Report)
> > >
> > >
> >
>





Re: Tamil Text Messaging in Mobile Phones

2002-07-31 Thread James Kass


Marco Cimarosti wrote,

> > 
> > ** The stroke in Phaistos symbols in ConScript PUA encoding is
> >  the closest I could find.
> 
> :-)))
> 
> C'mon, be serious! That can be mapped to U+0316 (COMBINING GRAVE ACCENT
> BELOW).
> 

Seriously, you're right!

Best regards,

James Kass.





Re: library for identifying equivalent sequences

2002-07-31 Thread Mark Davis

We do have that in ICU 2.2. It is not a public interface (meaning that we
will likely change the API before we make it public), but it is accessible
if you want to test with it for now.

It is part of what we use to optimize our internal processing by producing
the canonical closure of a dataset. See
http://oss.software.ibm.com/icu/docs/papers/normalization_iuc21.ppt for more
information.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Wednesday, July 31, 2002 15:29
Subject: library for identifying equivalent sequences


> I'm wondering if anyone is aware of any software libararies available that
> can be used to solve a particular problem: for a given character sequence,
> I need to enumerate all of the canonically equivalent character sequences.
> Put another equivalent way, given a character sequence in NFD, I need to
be
> able to enumerate all of the sequences that have the same NFD
> representation.
>
> (The underlying issue is that I'm trying to figure out, given some
> precomposed glyph in a font, what are all the valid substitutions that
> could be applied in the smart-font code.)
>
>
>
> - Peter
>
>
> --
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
>
>
>
>
>





Re: "Missing character" glyph

2002-07-31 Thread Doug Ewell

Asmus Freytag  wrote:

> No code point is safe.

Indeed, but some are less unsafe than others.  You can't use U+FFEF,
because some process might actually filter out noncharacters.  You can't
use U+FFFD, because some process might generate a special glyph for it
(SC UniPad does).  And the moment you settle on some hollow rectangle or
square as your example, someone will point out that it doesn't match the
black square or whatever that appears on *his* system.

-Doug Ewell
 Fullerton, California





Re: Teletext

2002-07-31 Thread Shlomi Tal

>From: Lars Marius Garshol <[EMAIL PROTECTED]>

>This reminds me: does anyone have any pointers to information on how
>to convert visually encoded text (especially HTML, but also other
>formats) to Unicode?

There are programs that do it on the fly for Hebrew. The best, which I have 
used myself, is HebTML, available for free downloading from 
http://www.billy.co.il . The author has been working with me on testing a 
new version that supports Unicode. However, I use this app much less than 
before, because Hebrew Internet is rapidly making the transition from visual 
to logical ordering. With IE 5.x and Mozilla supporting logical Hebrew, the 
years-old visual order is on the way out.

The conversion of visual to logical text in BiDi scripts is straightforward: 
validate the BiDi property of the character, and if RTL then reverse. That 
means Hebrew letters reverse their order, digits and Latin letters stay the 
same. Things get more complicated, however, when hyphens, paired punctuation 
and telephone numbers appear. You need a smart converter for that.

In essence, visually ordered Hebrew is a kludge for supporting Hebrew on 
platforms that weren't designed for it. In other words, it is an adaptation 
of Hebrew text to monodirectional LTR platforms. In modern software the onus 
of directionality passes on to software.

--

Shlomi Tal
שלומי טל


_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com





Re: UTF-8 to EBCDIC

2002-07-31 Thread Asmus Freytag

See the technical report on UTF-EBCDIC.
Perhaps, that's what's needed?

A./

http://www.unicode.org/reports/tr16

At 05:06 PM 7/31/02 -0700, Magda Danish (Unicode) wrote:


> > -Original Message-
> > From: Vishweshwaraiah, Balasubramanya
> > [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, July 30, 2002 2:52 PM
> > To: Magda Danish (Unicode)
> > Subject: RE: Web Form: General question
> >
> >
> > Magda Danish,
> > Thanks a lot for your interest in helping me by giving suggestions. I
> > visited the site you mentioned in your reply. I didn't get
> > any idea about
> > how to do the conversion from UTF-8 to EBCDIC or I may be
> > thinking in wrong
> > direction.
> >
> > All I understood from that site is the equivalent ebcdic
> > code(cp037) for
> > each Unicode character.
> >
> > If I have a UTF-8 formatted file, how does this table helps
> > me to do the
> > conversion?.
> >
> > Kindly advice me if I misunderstood any thing.
> >
> > Thanks
> > Balu.
> >
> >
> >
> > -Original Message-
> > From: Magda Danish (Unicode) [mailto:[EMAIL PROTECTED]]
> > Sent: Thursday, July 25, 2002 3:27 PM
> > To: [EMAIL PROTECTED]
> > Subject: FW: Web Form: General question
> >
> >
> > Have you tried looking at
> > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/
> >
> > Best Regards,
> >
> > Magda Danish
> > Administrative Director
> > The Unicode Consortium
> > 650-693-3921
> >
> >
> >
> > > -Original Message-
> > > Date/Time:Thu Jul 25 12:51:15 EDT 2002
> > > Contact:  [EMAIL PROTECTED]
> > > Report Type:  General question
> > >
> > > Hi,
> > >
> > > I need to convert UTF-8 format files to EBCDIC. Could you
> > > please suggest me any available tool that does this
> > > conversion or How to do this conversion.
> > >
> > > Thanking you lot in anticipation of quick response.
> > >
> > > With Best regards
> > > Balu.
> > >
> > > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > > (End of Report)
> > >
> > >
> >





Re: [OpenType] library for identifying equivalent sequences

2002-07-31 Thread Peter_Constable


On 07/31/2002 05:46:02 PM Eric Muller wrote:

>Eg. don't you also want the strings that contain a sprinkling of ZWJ,
>ZWNJ, CGJ, SHY and various other things?

(Yuck.) Why, of course. (Bleecchh.) But it's easier to write an algorithm
to insert those than to derive the other. (Gag, choke.) So, I was just
asking for the harder part, but if you've got something to offer that
generates the myriad possibilities involving both... (Wretch.)

Ahem. Actually, I, like you, would much rather not have to mess with all
this within fonts -- would much rather have the software / font interface
deal with the equivalencies -- but the current state of our technologies
requires that, if we want our fonts to provide the same display for all of
the different possible sequences that ought to appear the same, then we
have to deal with each and every one. And when you're dealing with a Latin
font and want to support stacking of multiple-diacritic combinations (given
different possible orderings, different possible combinations encoded as
precomposed characters, and all those invisibles), that's a lot of
possibilities.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>







Re: "Missing character" glyph

2002-07-31 Thread Kenneth Whistler

Asmus wrote:

> At 08:40 PM 7/30/02 -0700, Doug Ewell wrote:
> >a code-point that has no
> > > character assigned to it (and is not likely to get one), e. g. U+03A2
> 
> No code point is safe.

True enough. But then I figure Plane 13 characters like
U+DEAD1 are pretty unlikely to be assigned to a character 
in our lifetimes (or our children's lifetimes). 
That one is *reasonably* safe to use as an example. ;-)

--Ken

*remembers when he used to use 0xdeadbeef as a magic
number in tests because it was easy to spot in hex
displays*

> 
> A./




UTF-8 to EBCDIC

2002-07-31 Thread Magda Danish (Unicode)



> -Original Message-
> From: Vishweshwaraiah, Balasubramanya 
> [mailto:[EMAIL PROTECTED]] 
> Sent: Tuesday, July 30, 2002 2:52 PM
> To: Magda Danish (Unicode)
> Subject: RE: Web Form: General question
> 
> 
> Magda Danish,
> Thanks a lot for your interest in helping me by giving suggestions. I
> visited the site you mentioned in your reply. I didn't get 
> any idea about
> how to do the conversion from UTF-8 to EBCDIC or I may be 
> thinking in wrong
> direction.
> 
> All I understood from that site is the equivalent ebcdic 
> code(cp037) for
> each Unicode character. 
> 
> If I have a UTF-8 formatted file, how does this table helps 
> me to do the
> conversion?. 
> 
> Kindly advice me if I misunderstood any thing.
> 
> Thanks
> Balu. 
> 
> 
> 
> -Original Message-
> From: Magda Danish (Unicode) [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, July 25, 2002 3:27 PM
> To: [EMAIL PROTECTED]
> Subject: FW: Web Form: General question
> 
> 
> Have you tried looking at
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/
> 
> Best Regards,
> 
> Magda Danish
> Administrative Director
> The Unicode Consortium
> 650-693-3921
> 
> 
> 
> > -Original Message-
> > Date/Time:Thu Jul 25 12:51:15 EDT 2002
> > Contact:  [EMAIL PROTECTED]
> > Report Type:  General question
> > 
> > Hi,
> > 
> > I need to convert UTF-8 format files to EBCDIC. Could you 
> > please suggest me any available tool that does this 
> > conversion or How to do this conversion.
> > 
> > Thanking you lot in anticipation of quick response.
> > 
> > With Best regards
> > Balu.
> > 
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > (End of Report)
> > 
> > 
> 




Re: "Missing character" glyph

2002-07-31 Thread Asmus Freytag

At 08:40 PM 7/30/02 -0700, Doug Ewell wrote:
>a code-point that has no
> > character assigned to it (and is not likely to get one), e. g. U+03A2

No code point is safe.

A./




Re: Teletext

2002-07-31 Thread Kenneth Whistler

William Overington suggested:

> I am thinking that it would be a good idea to encode the archive copies of
> teletext pages that exist into a Unicode compatible format for the future.
> Teletext has been around for about a quarter of a century in more or less
> its present form and within another quarter of a century that form might
> well be gone completely.
> 
> I have looked in the Unicode mail list archive and found various items about
> encoding teletext pages using existing Unicode characters.
> 
> I am here suggesting a different approach, a teletext archiving approach.

[snip long details about a PUA-based approach]

> I feel that this encoding will be useful as a stepping stone to a permanent
> regular Unicode encoding of teletext characters for archiving purposes.
> 

> Readers interested in teletext might like to have a look at the following.
> 
> http://teletext.mb21.co.uk
> 
> I am hopeful that by having a specific encoding within Unicode for teletext
> that the archives of teletext pages that exist will be conserved for
> posterity and that an important aspect of social history will be preserved
> for the future.

While it is a laudable goal to aim for conservation of materials for
posterity, there needs to be some judicious selection that goes
into the *art* of archiving appropriate materials.

In the case of teletext, it seems to me that historical "fan" sites
like the one you have cited *are* the appropriate means for archiving
sufficient examples of teletext so that posterity understands not
merely how it was encoded, but sees actual examples of its use
and history, containing both text and blocky graphics in various 
TV markets.

If you are concerned about whether such sites themselves may be
transient, then the appropriate thing to do is to archive the
*sites*, together with their explanations and all their examples
in context.

I see very little value in trying to capture out the text from most
of the teletext materials I have seen into permanent Unicode text
archives. What would be the point? Most of the actual text content
is of a very transient and uninteresting nature:

"The next Telesoftware update will be on MONDAY, November 28.
 In future, telesoftware updating will take place on Mondays
 at fortnightly intervals.
 The day for updating telesoftware is being changed to enable
 us to provide a better, more responsive and more reliable
 service to users. ...
 More in a moment"

Uh, right. This is the kind of "information junk" that we daily
try to filter out of our lives.

So while there is a place for the study of the history of anything,
including teletext, I don't see any particular role that Unicode
has here -- the material is much better represented by using
2D renderings, as shown on the teletext.mb21 site.

--Ken

P.S. Teletext archival examples *are* interesting as a kind of
hint at what the web would be like, if the web were limited to
30x40 fixed-width cell character displays and 60x80 block graphics. ;-)




Re: [OpenType] library for identifying equivalent sequences

2002-07-31 Thread Eric Muller

I don't have what you are looking for [canonically equivalent strings], 
but I am curious how you plan to go from that to:

>(The underlying issue is that I'm trying to figure out, given some
>precomposed glyph in a font, what are all the valid substitutions that
>could be applied in the smart-font code.)
>

Eg. don't you also want the strings that contain a sprinkling of ZWJ, 
ZWNJ, CGJ, SHY and various other things?

Eric.







library for identifying equivalent sequences

2002-07-31 Thread Peter_Constable

I'm wondering if anyone is aware of any software libararies available that
can be used to solve a particular problem: for a given character sequence,
I need to enumerate all of the canonically equivalent character sequences.
Put another equivalent way, given a character sequence in NFD, I need to be
able to enumerate all of the sequences that have the same NFD
representation.

(The underlying issue is that I'm trying to figure out, given some
precomposed glyph in a font, what are all the valid substitutions that
could be applied in the smart-font code.)



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>







RE: Subscript & Superscript

2002-07-31 Thread Peter_Constable


On 07/31/2002 12:27:46 PM Michael Everson wrote:

>>Unicode is not for encoding typographical effects such as superscripts or
>>subscripts (the sups and subs in area U+2070..U+208E are part of a sort
of
>>"archaeological area" of Unicode, which is called Compatibility
Characters).
>
>Not quite. Some superscripts and subscripts used in phonetic
>transcriptions are actual letters needed in plain text.

Michael is right; with that exception, Marco was basically right. But let's
keep in mind that the original request had to do with superscripts and
subscripts for mathematical formulas, for which MathML is clearly
appropriate.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>







Re: Teletext

2002-07-31 Thread Lars Marius Garshol


* Shlomi Tal
| 
| 2. Teletext offers no bidirectional algorithm. The display mechanism
| is limited to monodirectional LTR, necessitating the use of visually
| encoded Hebrew (that is, monodirectional LTR written Hebrew; see
| also my Hebrew FAQ for a longer explanation). This needs to be
| inverted to logical order when converting to Unicode.

This reminds me: does anyone have any pointers to information on how
to convert visually encoded text (especially HTML, but also other
formats) to Unicode?

-- 
Lars Marius Garshol, Ontopian http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >





Re: Teletext

2002-07-31 Thread Shlomi Tal

Teletext uses VERY old technology encoding in general. I don't know if it's 
true for other languages, but Hebrew teletext encodes the Hebrew letters 
using the 7-bit SI-960, which maps the Hebrew letters instead of the 
lowercase Latin letters (positions 0x60 to 0x7A). In Hebrew teletext you get 
the following unmodern practices:

1. 7-bit encoding, which allows only uppercase Latin letters to be used in 
the mixed Hebrew/English mode. Compare Russian KOI-7, Greek ELOT 927, which 
are like Hebrew SI-960 in mapping the non-Latin alphabet on top of the 
lowercase letters.

2. Teletext offers no bidirectional algorithm. The display mechanism is 
limited to monodirectional LTR, necessitating the use of visually encoded 
Hebrew (that is, monodirectional LTR written Hebrew; see also my Hebrew FAQ 
for a longer explanation). This needs to be inverted to logical order when 
converting to Unicode.

--

Shlomi Tal
שלומי טל


_
Send and receive Hotmail on your mobile device: http://mobile.msn.com





RE: Subscript & Superscript

2002-07-31 Thread Michael Everson

At 19:05 +0200 2002-07-31, Marco Cimarosti wrote:

>Unicode is not for encoding typographical effects such as superscripts or
>subscripts (the sups and subs in area U+2070..U+208E are part of a sort of
>"archaeological area" of Unicode, which is called Compatibility Characters).

Not quite. Some superscripts and subscripts used in phonetic 
transcriptions are actual letters needed in plain text.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




RE: Subscript & Superscript

2002-07-31 Thread Marco Cimarosti

Magda Danish wrote:
> > -Original Message-
> > Date/Time:Tue Jul 30 12:26:40 EDT 2002
> > Contact:  [EMAIL PROTECTED]
> > Report Type:  FAQ Suggestion
> > 
> > We need to know how to express a Subscript letter in Unicode.
> > On your site, we've found in 2070-208E how to express a
> > Superscript letter or number or a Subscript number, but there
> > is no information about how to write a Subscript letter.
> > We're using the XML Authoring Software Epic developed by
> > Arbortext. We need to be able to express mathmatical
> > formulas in XML and we're trying to use Unicode to do it.
> > 
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > (End of Report)

Unicode is not for encoding typographical effects such as superscripts or
subscripts (the sups and subs in area U+2070..U+208E are part of a sort of
"archaeological area" of Unicode, which is called Compatibility Characters).

To implement superscripts and subscripts in XML, it is enough to write a
two-line cascading style sheet, plus a single line of code to link the CSS
to the XML document. That worked for me anyway.

_ Marco




RE: Tamil Text Messaging in Mobile Phones

2002-07-31 Thread Marco Cimarosti

James Kass wrote:
> Is this a graphic showing the experimental diacritics you mention?
> http://www.geocities.com/avarangal/imagay1.gif
> 
> If so, it should be possible for most of these to be encoded in text
> as pronunciation indicators using existing Unicode characters.
> 
> Glyph  -  Unicode
>  No. Poss.
> 
> [...]
> 0096  U+E6FD **
> [...]
> 
> ** The stroke in Phaistos symbols in ConScript PUA encoding is
>  the closest I could find.

:-)))

C'mon, be serious! That can be mapped to U+0316 (COMBINING GRAVE ACCENT
BELOW).

_ Marco




Re: "Missing character" glyph

2002-07-31 Thread John H. Jenkins


On Tuesday, July 30, 2002, at 08:58 PM, Doug Ewell wrote:

> Have Last Resort symbols been devised for all the blocks in Unicode,
> including the new ones like Tagalog?  Neither Mark Leisher's page nor
> the Apple typography page contains a complete list.
>
>

Yes.  It covers all of Unicode 3.2; but the font has been entirely 
redesigned.  We really need to update our documentation.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





Re: Subscript & Superscript

2002-07-31 Thread Peter_Constable


On 07/31/2002 03:48:07 AM "William Overington" wrote:

>I know little about XML so I do not know whether this suggestion will be a
>suitable solution for the requirement of the person who wrote to the
Unicode
>Consortium.

Not at all, I'm afraid. The person who wrote:

>>> -Original Message-
>>> Date/Time:Tue Jul 30 12:26:40 EDT 2002
>>> Contact:  [EMAIL PROTECTED]
>>> Report Type:  FAQ Suggestion
[snip]
>>> We're using the XML Authoring Software Epic developed by
>>> Arbortext. We need to be able to express mathmatical
>>> formulas in XML and we're trying to use Unicode to do it.

needs to investigate MathML: http://www.w3.org/Math/



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>







Teletext

2002-07-31 Thread William Overington

In the United Kingdom there is a widely used information system known as
teletext.  It is also used in many other countries.

Teletext is a digital technology used in conjunction with analogue
television systems.  Digital information is inserted in several of the
otherwise unused lines of the television signal within what is known as the
vertical blanking interval of the television picture.

In the United Kingdom the government is to eventually switch off all
analogue television broadcasts, as part of the already started process of
migration to digital television technology.  Thus teletext in its present
form will finish.  There are digital television text and graphics displaying
systems which may continue the teletext name, yet the original teletext
display format is likely to go.

Teletext started in the early 1970s and the currently implemented
specification essentially dates from 1976, (with the exception of the later
fast text linking system).

The government is thinking in terms of turning off the analogue
transmissions sometime between 2006 and 2010.

I am thinking that it would be a good idea to encode the archive copies of
teletext pages that exist into a Unicode compatible format for the future.
Teletext has been around for about a quarter of a century in more or less
its present form and within another quarter of a century that form might
well be gone completely.

I have looked in the Unicode mail list archive and found various items about
encoding teletext pages using existing Unicode characters.

I am here suggesting a different approach, a teletext archiving approach.

I suggest that, in a discussion within this mailing list, a Private Use Area
encoding for archiving teletext pages is agreed, with a view that eventually
it will be put as a proposal for promotion to regular Unicode, probably into
one of the higher planes.

The reason for this approach is that it will permit teletext pages to be
encoded in a plain text file within a document which discusses the
technology.  The teletext characters need to be implemented with the same
width as each other, whereas characters in a discussion document need to be
displayable with possibly different widths one from another.

I suggest, as a starting point for a discussion the following.

U+E200 through to U+E27F for the United Kingdom teletext character set 0x00
to 0x7F.

U+E280 through to U+E2FF to be used to define teletext characters defined in
other countries where those characters are not the same as in the United
Kingdom character set.  This means all of the German accented characters and
so on.  The notes for each encoding to include details of the location
within the 0x00 to 0x7F range where that character was originally encoded
and in which country or countries it was so encoded.

All teletext pages could then be encoded using the above characters.

In addition, the following could be used.

Where a character is to be displayed in contiguous graphics mode, and is a
graphic, not a capital letter push through, the character may be represented
using U+E320 to U+E33F and U+E360 and U+E37F.

Where a character is to be displayed in separated graphics mode, and is a
graphic, not a capital letter push through, the character may be represented
using U+E3A0 to U+E3BF and U+E3E0 and U+E3FF.

This will enable a good idea of the look of a teletext page to be displayed
using an ordinary TrueType font in a wordprocessing document.  Naturally
there is also scope for special teletext displaying programs to be produced
so that graphics with different combinations of foreground and background
colours can be displayed properly.

I feel that this encoding will be useful as a stepping stone to a permanent
regular Unicode encoding of teletext characters for archiving purposes.

Hopefully this initiative will encourage people to get out any old 5 1/4
inch floppy discs that they may have and transfer any teletext pages saved
upon them into an archived form.

Readers interested in teletext might like to have a look at the following.

http://teletext.mb21.co.uk

I am hopeful that by having a specific encoding within Unicode for teletext
that the archives of teletext pages that exist will be conserved for
posterity and that an important aspect of social history will be preserved
for the future.

Does anyone know if the early graphic art from Oracle (Oracle being the name
of the then ITV teletext service as well as of the technology, being an
acronym for Optional Reception of Announcements by Coded Line Electronics)
in the mid 1970s has survived?

Also, does anyone archive Viewdata pages?  Viewdata was not a broadcasting
technology but provided pages with a compatible display format to teletext
which pages could be accessed over a telephone line connection.

William Overington

31 July 2002












Re: Subscript & Superscript

2002-07-31 Thread William Overington

Some time ago in this list, Mr Bernard Miller posted a note about his Bytext
system.

If one goes to http://www.bytext.org and then goes through to the
documentation page at http://www.bytext.org/documentation.htm one may
download a copy of the latest edition of The Bytext Standard.  I chose to
download the pdf file which is 606 kilobytes.

On pages 34 and 35 of that document are details of arrow parentheses
invented by Mr Miller.

On page 72 is a statement concerning intellectual property rights.

I feel that it would be very useful if these eight arrow parenthesis
characters are used in a Unicode compatible environment.

As some readers may know I have been researching on my courtyard codes
system.

http://www.users.globalnet.co.uk/~ngo/court000.htm

Courtyard codes are placed within the Private Use Area of Unicode.  The
above document being indexed from an index page about some of my other uses
of the Private Use Area.

http://www.users.globalnet.co.uk/~ngo/golden.htm

It occurs to me that if the eight arrow parenthesis characters were encoded
into my courtyard codes system, then that would be potentially of great
usefulness.

I am thinking in terms of U+F388 through to U+F38F being used for this
purpose, with the codes being assigned to the arrow parentheses in the order
in which Mr Miller lists them in The Bytext Standard.

If this happens then the way to express a subscript uppercase A character
would be as follows.

U+F38A U+0041 U+F38B

The U+0041 is the code for A in regular Unicode, so immediately there is a
general method for subscripting any Unicode character.  Indeed subscripts of
subscripts could be used by nesting the arrow parentheses.

For example, a subscript A subscript B could be expressed as follows.

U+0061 U+F38A U+0041 U+F38A U+0042 U+F38B U+F38B

The U+0061 is the code for a in regular Unicode and the U+0042 is the code
for B in regular Unicode.

Arrow parentheses allow a mathematical expression involving superscripts,
subscripts, integral limits, summation limits and various other items to be
expressed in a linear manner, which makes those expressions able to be
stored in a Unicode file in what is essentially a plain text storage format,
though I mention that this will not be plain text as such as it involves the
use of code points for what might be considered markup.

I know little about XML so I do not know whether this suggestion will be a
suitable solution for the requirement of the person who wrote to the Unicode
Consortium.

However, perhaps it will be a helpful suggestion.

Certainly using the codings which I suggest would involve use of code points
from the Private Use Area.  However, as the need is now, then even if the
arrow parenthesis characters are one day promoted to regular Unicode, the
use of Private Use Area characters now may be what is needed to achieve the
desired result.

By placing these code point ideas into this posting to the Unicode mail
list, they will be archived in the archives of the Unicode mail list and
also sent to many people interested in Unicode around the world.  So,
although they are only Private Use Area encodings, it is possible that these
encodings will be noted in many places by many people.  It is simply
speculation as to whether few or many people will choose to recognize such
code point allocations for their own uses.

The use of these code points would raise the question as to how a string
containing them should be displayed.  The idea is that in a plain text
editor mode, the arrow parenthesis characters would be displayed with the
glyphs shown by Mr Miller in The Bytext Standard.  In a graphical display,
the arrow parenthesis characters would not be displayed, yet would influence
how characters included between matching pairs of arrow parenthesis
characters are displayed.  This is no more complicated in principle than
viewing an HTML page in Internet Explorer then viewing the source code of
the HTML page in Notepad then going back to the Internet Explorer display.

Whether any font makers would add glyphs for the eight arrow parenthesis
characters into the code positions U+F388 though to U+F38F remains to be
seen, though I am cautiously optimistic in the matter.  Also the possibility
exists for the person who originally wrote to the Unicode Consortium to have
his or her own font produced in addition to any font maker making such a
font available.

William Overington

31 July 2002

-Original Message-
From: Magda Danish (Unicode) <[EMAIL PROTECTED]>
To: unicode <[EMAIL PROTECTED]>
Date: Tuesday, July 30, 2002 8:46 PM
Subject: Subscript & Superscript


>
>> -Original Message-
>> Date/Time:Tue Jul 30 12:26:40 EDT 2002
>> Contact:  [EMAIL PROTECTED]
>> Report Type:  FAQ Suggestion
>>
>> We need to know how to express a Subscript letter in Unicode.
>> On your site, we've found in 2070-208E how to express a
>> Superscript letter or number or a Subscript number, but there
>> is no information about how to write a Subs

Re: "Missing character" glyph

2002-07-31 Thread Michael Everson

At 17:15 -0400 2002-07-30, Tom Gewecke wrote:

>  >Apple's Last Resort font. :-)
>
>Which I believe uses the various symbols shown at
>
>http://www.unicode.org/charts/
>
>so you can easily tell from which code range your font is missing the
>character.

I think those glyphs are from the older version of the Last Resort font.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: quotation marks in European languages

2002-07-31 Thread Otto Stolz

Scripsissem: 
> The correct quote symbols, according to the German typographic
> tradition, are
...


John Cowan scripsit:
> Does not German also support the quotation dash for dialogue?


Not really.

You may use the dash to indicate change of the speaker,

cf. .

Now, if you apply this scheme and drop the quote symbols entirely,
there are nothing but dashes to delimit the utterances -- which
may lead to the wrong perceiption that these dashes be used as
quote symbols. However, these dashes separate the contributions
of the various speakers, whilst quote symbols would enclose them.

Best wishes,
   Otto Stolz





Re: Tamil Text Messaging in Mobile Phones

2002-07-31 Thread James Kass


Dear Sinnathurai Srivas,

Is this a graphic showing the experimental diacritics you mention?
http://www.geocities.com/avarangal/imagay1.gif

If so, it should be possible for most of these to be encoded in text
as pronunciation indicators using existing Unicode characters.

Glyph  -  Unicode
 No. Poss.

0152  U+0309 *
0094  U+0302
0153  U+0303
0154  U+0308
0155  U+0304
0156  U+031A
0134  U+0325
0096  U+E6FD **
0135  U+0339
0136  U+02E9 ***
0137  U+
0138  U+2321 ***
0139  U+2218 ***

*  The reference glyph in the standard is reversed.  But, the
reference glyphs are only informative; the actual glyph
 shapes are up to the font developer.

** The stroke in Phaistos symbols in ConScript PUA encoding is
 the closest I could find.

***These characters were selected only for their appearance.

With the above in mind, here's an attempt to encode part of the
examples in the graphic linked above in Unicode (UTF-8):

அ̉ அ̂ அ̃ அ̈ அ̄ அ̚
அ̄̉
க̥ க க̹ க˩ க? க⌡ க∘

Admittedly, the display of the above here is less than optimal,
but this is a font/display issue rather than an encoding issue.
(At least no dotted circles are appearing here in the display.)

As Peter Constable wrote recently in reply to Keld Jørn Simonsen:
>>My point has been that that language community would be much
>>better served by dropping the idea of using "@" in this way and picking
>>something else since, as suggested in your comment, 10646 has lots to
>>choose from.

And, this is a good point.  There are many existing characters in
Unicode from which to choose, not only for orthographies, but
even for pronunciation symbols.

William Overington and Martin Kochanski independently suggested
that the Private Use Area would be well-suited for any experimental
characters, in case some forms can't already be found, or existing
Unicode forms are not acceptable.  The PUA remains an option for
the pronouncing glyphs.

Font replacement by the system sometimes solves problems and
sometimes makes problems.  Arbitrary font-switching is not an
encoding issue, but one way to avoid it is to use applications
which either don't do it, or allow the user to disable it.

Best regards,

James Kass.

- Original Message -
From: "Sinnathurai Srivas" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, July 30, 2002 7:44 AM
Subject: Re: Tamil Text Messaging in Mobile Phones


Dear James Kass,

For a pronounciation Dictionary, a set of diacritics that is in experiment
need to be included

and

when this additional (diacritics) occur in text, OS should not decide some
thing is wrong with grammar and substitue with dotted circles or assumes the
font is faulty and replaces with another font which does not know anything
about this additional diacritics used.

Sinnathurai Srivas