Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-19 Thread Peter_Constable

On 08/17/2002 09:29:00 AM William Overington wrote:

Peter Constable wrote as follows.

The standard already specifies that FFFC should not be exported from an
application or interchanged.

As far as I am aware that is not presently the case.

If you still say that that is correct, could you please state the exact
text
of the standard relating to this matter and where in the standard that
text
can be found please?

OK, it doesn't say it explicitly; nevertheless, I believe I know what the
intent of the text is, and that it is not condoning interchange of FFFC.
The fact that the text isn't more explicit is something that could perhaps
be improved; but if you think about what the text on pp 326-7 *does* say,
I think this intent can be detected. It seems clear to me that it assumes
usage within the context of some higher-level protocol, such as would be
imposed by a software process. For instance, the text makes reference to 
the object's formatting information,  but Unicode / plain text does not
provide representation for such information. Thus, there necessarily must
be some other protocol at work within which that information is
represented. FFFC, then, it something that is utilised by that
higher-level protocol. Hence, this section of the Standard is *not*
talking about FFFC being used in interchanged plain text. It is, rather,
assuming usage internal to some processing context or other higher-level
protocol.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]



















Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-19 Thread Peter_Constable

On 08/16/2002 04:58:58 PM William Overington wrote:

The DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform)
system
(details at http://www.mhp.org ) which implements my telesoftware
invention.
A Java program which has been broadcast can read a Unicode plain text
file
and act upon the characters within it, and can read other file formats,
such
as .png files (Portable Network Graphics) and act upon the information in
those files, so as to produce a display.

So, a collection of files, namely a .uof file in the format that I
suggested
it, a Unicode plain text file with one or more U+FFFC characters in it
and
the appropriate graphics files in .png format as a package of free to the
end user distance education learning material being broadcast from a
direct
broadcasting satellite or a terrestrial transmitter could be a very
useful
facility as the way to carry text with illustrations.

I'd suggest that it would be far more useful to use a marked-up file
format based on XML. It doesn't have to be verbose (besides which, the
bandwidth requirements of embedded graphics will be far greater than any
requirements for markup used to indicate their position within the text).
The reason I think this would be far more advantageous is that there has
been a massive interest throughout the IT industry in XML, meaning that
there are lots of software implementations that support it, and it is very
easy to build processes for publishing content. You coulde probably use
any commonly-used database product out there to generate XML content
suited for DVB-MHP; in fact, it would be easy to take some existing
XML-based publishing process and extend it to support an XML-based file
format specifically intended for DVB-MHP. In contrast, if you want to
invent a new file format, then you've got to create new software
implementations to go with it, and bolting that into any existing
publishing process will be far more costly.



Using HTML and a browser is just not the way to proceed in that
situation.
HTML and a browser is a very useful technique for the web and indeed is
an
option for the DVB-MHP system, yet the basic software system is Java
based.

Markup does not have to imply HTML and a Web browser. I'm sure you'd find
a lot of Java implementations that made use of XML-based file formats, and
though I'm not a Java programmer, I'm certain that you can find good
support for parsing or generating XML streams in Java.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





















Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-19 Thread Peter_Constable

On 08/16/2002 04:58:58 PM William Overington wrote:

The DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform)
system
(details at http://www.mhp.org ) which implements my telesoftware
invention.
A Java program which has been broadcast can read a Unicode plain text
file
and act upon the characters within it, and can read other file formats,
such
as .png files (Portable Network Graphics) and act upon the information in
those files, so as to produce a display.

So, a collection of files, namely a .uof file in the format that I
suggested
it, a Unicode plain text file with one or more U+FFFC characters in it
and
the appropriate graphics files in .png format as a package of free to the
end user distance education learning material being broadcast from a
direct
broadcasting satellite or a terrestrial transmitter could be a very
useful
facility as the way to carry text with illustrations.

I'd suggest that it would be far more useful to use a marked-up file
format based on XML. It doesn't have to be verbose (besides which, the
bandwidth requirements of embedded graphics will be far greater than any
requirements for markup used to indicate their position within the text).
The reason I think this would be far more advantageous is that there has
been a massive interest throughout the IT industry in XML, meaning that
there are lots of software implementations that support it, and it is very
easy to build processes for publishing content. You coulde probably use
any commonly-used database product out there to generate XML content
suited for DVB-MHP; in fact, it would be easy to take some existing
XML-based publishing process and extend it to support an XML-based file
format specifically intended for DVB-MHP. In contrast, if you want to
invent a new file format, then you've got to create new software
implementations to go with it, and bolting that into any existing
publishing process will be far more costly.



Using HTML and a browser is just not the way to proceed in that
situation.
HTML and a browser is a very useful technique for the web and indeed is
an
option for the DVB-MHP system, yet the basic software system is Java
based.

Markup does not have to imply HTML and a Web browser. I'm sure you'd find
a lot of Java implementations that made use of XML-based file formats, and
though I'm not a Java programmer, I'm certain that you can find good
support for parsing or generating XML streams in Java.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





















Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread William Overington

Kenneth Whistler wrote as follows about my idea.

 It occurs to me that it is possible to introduce a convention, either as
a
 matter included in the Unicode specification, or as just a known about
 thing, that if one has a plain text Unicode file with a file name that
has
 some particular extension (any ideas for something like .uof for Unicode
 object file)

...or to pick an extension, more or less at random, say .html

Well, that could produce confusion with a .html file used for Hyper Text
Markup Language, HTML.

I suggested .uof so that a .uof file would be known as being for this
purpose.


 that accompanies another plain text Unicode file which has a
 file name extension such as .txt, or indeed other choices except .uof (or
 whatever is chosen after discussion) then the convention could be that
the
 .uof file has on lines of text, in order, the name of the text file then
the
 names of the files which contains each object to which a U+FFFC character
 provides the anchor.

 For example, a file with a name such as story7.uof might have the
following
 lines of text as its contents.

 story7.txt
 horse.gif
 dog.gif
 painting.jpg

This is a shaggy dog story, right?

No, it is a story about an artist who wanted to paint a picture of a horse
and a picture of a dog and, since he knew that the horse and the dog were
great friends and liked to be together and also that he only had one canvas
upon which to paint, the artist painted a picture of a landscape with the
horse and the dog in the foreground, thereby, as the saying goes, painting
two birds on one canvas, http://www.users.globalnet.co.uk/~ngo/bird0001.htm
in that he achieved two results by one activity.  In addition the picture
has various interesting details in the background, such as a windmill in a
plain (or is that a windmill in a plain text file).  :-)

 The file story7.uof could thus be used with a file named story.txt so as
to
 indicate which objects were intended to be used for three uses of U+FFFC
in
 the file story7.txt, in the order in which they are to be used.

Or we could go even further, and specify that in the story7.html file,
the three uses of those objects could be introduced with a very specific
syntax that would not only indicate the order that they occur in, but
could indicate the *exact* location one could obtain the objects -- either
on
one's own machine or even anywhere around the world via the Internet! And
we could
even include a mechanism for specifying the exact size that the object
should be
displayed. For example, we could use something like:

img src=http://www.coteindustries.com/dogs/images/dogs4.jpg; width=380
 height=260 border=1

or

img src=http://www.artofeurope.com/velasquez/vel2.jpg;

Now that is a good idea.  In a .uof file specifically for the purpose, a
line beginning with a  character could be used to indicate a web based
reference, or a local reference, for the object, using exactly the same
format as is used in an HTML file.

If the line does not start with a  character, then it is simply a file name
in the same directory as the .uof file, as I suggested originally.  This
would mean that where, say, a .uof file were broadcast upon a telesoftware
service that the Java program (also broadcast) analysing the file names in
the .uof file need not necessarily be able to decode lines starting with a 
character so that the Java program does not need to have the software for
that decoding in it, yet the same .uof file specification could be used,
both in a telesoftware service and on the web, where a more comprehensive
method of referencing objects were needed.

 I can imagine that such a widely used practice might be helpful in
bridging
 the gap between being able to use a plain text file or maybe having to
use
 some expensive wordprocessing package.

And maybe someone will write cheaper software -- we could call it a
browser --
that could even be distributed for free, so that people could make use of
this convention for viewing objects correctly distributed with respect to
the text they are embedded in.

Indeed, except not call it a browser as the name is already in widespread
use for HTML browsers and might cause confusion.  Analysing a .uof file
would be a much less computational task than analysing the complete syntax
of HTML files.

Yes, yes, I think this is an idea which could fly.

--Ken


Good.  It is a solution which could be very useful for people writing
programs in Java, Pascal and C and so on which programs take in plain text
files and process them for such purposes as producing a desktop publishing
package.

Hopefully the Unicode Technical Committee will be pleased to add a .uof
format file specification into the set of Unicode documents so that the
U+FFFC code can be used in an effective manner.  The idea could be that if a
.uof file is processed then the rules of .uof files apply in that situation,
so that if a .uof file is not being processed, then the rules for .uof files
do not apply, therefore 

Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread Barry Caplan



Yes, yes, I think this is an idea which could fly.

--Ken


Good.  It is a solution which could be very useful for people writing
programs in Java, Pascal and C and so on which programs take in plain text
files and process them for such purposes as producing a desktop publishing
package.


Uhh, I think Ken's message was entirely sarcasm or some higher form of rhetorical 
humor whose obscure name slips my mind right now.

The suggestion to use html as an extension was the give away - I was laughing out 
loud from that point on - his point was that the technology to do what you want 
already exists it is called HTML and it is displayed by browsers and so forth.

Barry Caplan
www.i18n.com





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread James Kass


William Overington wrote,

 
 No, it is a story about an artist who wanted to paint a picture of a horse
 and a picture of a dog and, since he knew that the horse and the dog were
 great friends and liked to be together and also that he only had one canvas
 upon which to paint, the artist painted a picture of a landscape with the
 horse and the dog in the foreground, thereby, as the saying goes, painting
 two birds on one canvas, http://www.users.globalnet.co.uk/~ngo/bird0001.htm
 in that he achieved two results by one activity.  In addition the picture
 has various interesting details in the background, such as a windmill in a
 plain (or is that a windmill in a plain text file).  :-)
 

1)  It's gif file format rather than plain text.*
2)  There isn't any windmill.

Best regards,

James Kass,

* P.S. - But, it's a nice gif file.  In fact, aside from the absence of
the windmill, it exceeded my expectations.  -JK.








Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread Tex Texin

William,

So let me see if I understand this correctly.

Let's take 2 perfectly good standards, Unicode and HTML, and make some
very minor tweaks to them, such as changing the meaning of U+FFFC and a
special format for filenames in the beginning of the file and a new
extension, so we have something new.

Now the big benefit of this completely new thing, is that programs that
do desktop publishing can use plain text files which are not quite plain
text because they have some special formatting, but now they can publish
them in better manner than before. For example, plain text with
pictures. This is great. (It is true that it is less capable than if we
had just used enough html to do the same thing, but .uof is more like
plain text than html is.) Programmers will be happy because now they can
support plain text with just a few tweaks. Oh I almost forgot, they also
have to support Unicode, but slightly tweaked. And they can also support
HTML, with some minor tweaks for .uof. Of course programmers don't mind
supporting lots of variations of the same thing. Customer support
personnel also don't mind.
Oh, the plain text programmers will now need to support pictures and
other aspects of full publishing, but at least they won't have a complex
file format to work with. I guess it doesn't matter that a more complex
format is also more expressive and therefore can leverage all of the
publishing features. It probably doesn't matter that a desktop
publishing product probably already supports more complex formats, and
probably also supports html, it will be beneficial to add this slight
difference from plain text.

I like this very much. It is very much like when the magician slides the
knot in the string and makes it disappear.

I imagine that over time we will have some more wonderful inventions and
add further tweaks and further improve the publishing of plain text.

There are a few other things I would like to improve in Unicode, so I
hope it will be ok to make some other suggestions. We can change the
extention to know which tweaks we are talking about. .uo1, .uo2. Just a
few small changes to characters and plain text format variations.
Stability of the meaning of the file isn't important.

However, I think my first suggestion will be to make the benefits of
.uof available to XML. We can all this .uo1.

I am a little disconcerted that html already can do everything that .uof
does plus more, and is also supported by all of the publishers that are
like to support .uof. Also, as there are more than a million characters
in Unicode, most are unused so far, so changing the meaning of just FFFC
in this one context doesn't seem like a big win, considering also every
line of code that might work with FFFC now needs to consider the context
to determine its semantics.
But every invention deserves to be implemented, we need not look at
whether the invention satisfies some demand of its customers.

I like the 2 birds picture and I assume it was a metaphor for the idea-
one bird was html the other unicode. I was a little disappointed that
you used html instead of .uof format though. 

Maybe its the lateness of the hour here. I hope the idea looks as good
in the morning.

Oh I almost forgot. I was having difficulty discerning when you and Ken
might be joking. The mails read very serious. I would like to suggest we
make a new format .uo2. We can indicate line numbers and emotions with
plain text characters that look like facial expressions. It would help
me know when you both were serious and when you might be joking.
Sometimes it is hard to tell. I am going to create a list of facial
expressions and assign them in the PUA so we can all have a standard to
follow. See my next mail with a list of facial expressions and
assignments.
tex



William Overington wrote:
 
 Kenneth Whistler wrote as follows about my idea.
 
  It occurs to me that it is possible to introduce a convention, either as
 a
  matter included in the Unicode specification, or as just a known about
  thing, that if one has a plain text Unicode file with a file name that
 has
  some particular extension (any ideas for something like .uof for Unicode
  object file)
 
 ...or to pick an extension, more or less at random, say .html
 
 Well, that could produce confusion with a .html file used for Hyper Text
 Markup Language, HTML.
 
 I suggested .uof so that a .uof file would be known as being for this
 purpose.
 
 
  that accompanies another plain text Unicode file which has a
  file name extension such as .txt, or indeed other choices except .uof (or
  whatever is chosen after discussion) then the convention could be that
 the
  .uof file has on lines of text, in order, the name of the text file then
 the
  names of the files which contains each object to which a U+FFFC character
  provides the anchor.
 
  For example, a file with a name such as story7.uof might have the
 following
  lines of text as its contents.
 
  story7.txt
  horse.gif
  dog.gif
  painting.jpg
 

Re: Furigana

2002-08-16 Thread Peter_Constable

On 08/14/2002 05:53:58 AM James Kass wrote:

Once a meaning like
INTERLINEAR ANNOTATION ANCHOR has been assigned to
a code point, any application which chooses to use that code
point for any other purpose would be at fault.

Since it's for internal use only, nobody would ever know. Unicode 
conformance must always be understood in terms of what happens externally, 
between two processes, or between a process and a user. What goes on 
inside doesn't matter as long as it is conformant on the outside. If my 
program includes a portion of code that interprets all USVs as jelly-bean 
flavours but doesn't let any symptoms of that leak outside, I haven't 
voilated any conformance requirement.



In other words, if these characters are to be used internally for
Japanese Ruby (furigana), etc., then they ought to be able to
be used externally, as well.

They simply aren't adequate for anything more than the simplest of cases. 
Moreover, the recommdations of TR#20 / the W3C character model clearly 
indicate that markup is to be preferred for applications like this.



Because it seems to be an oxymoron.

I think most would agree that that's clear now, but it wasn't always 
understood so clearly.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





RE: Furigana

2002-08-16 Thread Peter_Constable

On 08/14/2002 10:52:32 AM Michael Everson wrote:

I'm saying I WANT to use these characters. They solve an apparent
need of mine

They only *appear* to you to solve that need, but in fact do not offer a 
good solution. Markup is recommended for your need.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread Peter_Constable

On 08/14/2002 02:04:50 PM William Overington wrote:

As this concerns the U+FFFC character and the Unicode Technical Committee 
is
due to meet next week, I think it might be helpful if this idea is 
discussed
before the meeting as a straightforward idea like this might mean that 
the
possibility to exchange U+FFFC characters at all if people want to do so 
is
not lost.

This does not solve any problems not already solved. This is not plain 
text; it is a form of interchange markup and a higher-level protocol. 
There are already higher-level markup protocols that accomplish this. The 
standard already specifies that FFFC should not be exported from an 
application or interchanged. There is no reason to change this.


Everybody will welcome the new conventional, graphical-type characters
and scripts that are coming with Unicode 4.0.

What are those please?

See the Proposed characters section of the Unicode site.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: RE: Furigana

2002-08-16 Thread Peter_Constable

On 08/14/2002 01:16:29 AM starner wrote:

That seems to be basically what William Overington is proposing,
except these characters only handle furigana, instead all markup.

Not quite. WO has proposed characters to be used in interchange. These are 
only intended for internal use by programmers. They are exactly like the 
non-characters at FDD0..FDEF except that these were named to a specific 
function (as was FFFC -- also an internal-use code with a 
specifically-named function).



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Furigana

2002-08-16 Thread Tex Texin



[EMAIL PROTECTED] wrote:
 
 On 08/14/2002 12:45:22 AM Kenneth Whistler wrote:
 
 But even at the time, as the record of the deliberations would
 show, if we had a more perfect record, the proponents were clear
 that the interlinear annotation characters were to solve an
 internal anchor point representation problem.
 
 I recall at the UTC meeting in Jan 2000 (I think it was 2000) there was
 discussion of adding non-character code points for internal use by
 programmers, and I remember Tex suggesting that it might be better to
 identify the specific functions for which internal-use codepoints might be
 needed, as had been done in the case of things like the IA characters. In
 other words, at that time, it seems that they were understood by everyone
 present to be intended for internal use by programmers only.

Peter's made the point that for internal use was understood which is
fine.

Let me add, that my concern with internal-use code points not having
specific functions, is that we now live in a world where software
applications often use third party components (various drivers, shared
libraries, OCXs, DLLs, etc.) internally. Having internal-use code
points, which may not be treated with the right semantics by 3rd
parties that have been integrated with internally, is problematic. You
should be careful and avoid passing these internal-use code points to
third parties, but this greatly inhibits their use, or makes for an
awkward and not easily extensible architecture.

At the time (in the discussion), I don't think we had many examples of
what the uses would be, and it wan't clear that many were needed, since
the functionality could be arrived at with higher level protocols.

So to be clear, when internal-use code points are used, not only do they
need to be filtered from external exchanges, you need to be very clear
about your internal architecture and make sure you don't call a system
function or third party function that might mistreat the i-u. c. p or
worse barf at it.

(Anyway, I think that's what I was thinking at the time. I have trouble
remembering what I said yesterday much less the last millenia.)
tex

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)

2002-08-16 Thread William Overington

Kenneth Whistler replied to my posting as follows.

 An interesting point for consideration is as to whether the following
 sequence is permitted in interchanged documents.

 U+FFF9 U+FFFC U+FFFA Temperature variation with time. U+FFFB

 That is, the annotated text is an object replacement character and the
 annotation is a caption for a graphic.

Yes, permitted.

Great.  That may well be useful for free to the end user distance education
using telesoftware upon digital television channels.  A .uof file (as in the
thread An idea for keeping U+FFFC usable. ) could be used with a Unicode
plain text file of some learning material over the broadcast link and a Java
program (also broadcast) could place the pictures with their captions in the
correct place in the text.

As would also be:

U+FFF9 U+FFFC U+FFFC U+FFFA U+FFF9 Temperature U+FFFA a measure of
hotness, related to the U+FFF9 kinetic energy U+FFFA energy of motion
U+FFFB
of molecules of a substance U+FFFB U+FFF9 variation U+FFFA rate of change
U+FFFB with time U+FFFC . U+FFFB

Where the first U+FFFC is associated with a URL with a realtime data feed,
the second U+FFFC is a jar file for a 3-dimensional dynamic display
algorithm,
and the third U+FFFC is a banner ad for Swatch watches.


Thank you for this example.  I have analysed it thoroughly using Notepad by
going to a new line and indenting at each occurrence of U+FFF9 and going to
a new line and indenting at each occurrence of U+FFFA, and going to a new
line and placing each U+FFFB beneath the corresponding U+FFFA.  For each
U+FFFC I went to a new line, and placed the U+FFFC beneath the most recent
U+FFF9 or U+FFFA character.

In addition, after each U+FFF. character, for ordinary text, I went to a
new line and indented so that the next ordinary text character was beneath
the U of the most recently entered U+FFF. character, except that after a
U+FFFB the indentation went back two indentation levels.

After each U+FFFC character, and on the same line, I added the details of
the object within parentheses.

This gave the following.

U+FFF9
U+FFFC (URL with a realtime data feed)
U+FFFC (jar file for a 3-dimensional dynamic display algorithm)
U+FFFA
U+FFF9
Temperature
U+FFFA
a measure of hotness, related to the
U+FFF9
kinetic energy
U+FFFA
energy of motion
U+FFFB
of molecules of a substance
U+FFFB
U+FFF9
variation
U+FFFA
rate of change
U+FFFB
with time
U+FFFC (banner ad for Swatch watches)
.
U+FFFB

This took me quite some time to figure out, and was indeed an interesting
challenge.

 It seems to me that if that is indeed permissible that it could
potentially
 be a useful facility.

I was referring to my original example, not to your example!  :-)


Permissible does not imply useful, however, in this case.

That's referring to your example when you refer to this case is it?  :-)

It is
unlikely that you are going to have access to software that would
unscramble such layering in purported plain text, even if you
had agreements with your receivers.

Hmm?  Yet, it is not the example to which I referred.  The example to which
I referred has not been commented upon as to its practical feasibility has
it?

However, is your example that difficult if someone set his or her mind to
it?  Consider for example that the software which does the unscrambling were
to have its own internal list of annotation facilitating characters so that
it assigned, for each page of the final rendered text, the characters in the
list of annotation facilitating characters in order for each U+FFF9 U+FFFA
pairing wherever the U+FFF9 item to be annotated were other than just one or
more U+FFFC characters.  The list of annotation facilitating characters
could be something like U+002A, U+2020, U+2021, U+2051, that is, asterisk,
dagger, double dagger, two asterisks aligned vertically.  The annotation
facilitating character is then placed both after the annotated item and
before the annotation, wherever that may be on the page, such as in a
footnote.  I am not suggesting that an algorithm for such is quickly
programmable, yet it does not seem on the face of it to be as unlikely to be
possible as your comment might perhaps seem to imply.

That is what markup and rich text formats are for.

Well, maybe for your example, yet for my example a plain text file for the
main text together with a .uof file to state 

Re: Furigana

2002-08-16 Thread John Cowan

Tex Texin scripsit:

 At the time (in the discussion), I don't think we had many examples of
 what the uses would be, and it wan't clear that many were needed, since
 the functionality could be arrived at with higher level protocols.

One application that has always seemed obvious to me is regular expressions:
a compiled regular expression can be represented by a Unicode string,
with non-characters representing things like any character, zero or more,
one or more, beginning of string, end of string, etc. etc.

-- 
John Cowan   [EMAIL PROTECTED]   http://www.ccil.org/~cowan
One time I called in to the central system and started working on a big
thick 'sed' and 'awk' heavy duty data bashing script.  One of the geologists
came by, looked over my shoulder and said 'Oh, that happens to me too.
Try hanging up and phoning in again.'  --Beverly Erlebacher




Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread William Overington

James Kass wrote as follows.

William Overington wrote,


 No, it is a story about an artist who wanted to paint a picture of a
horse
 and a picture of a dog and, since he knew that the horse and the dog were
 great friends and liked to be together and also that he only had one
canvas
 upon which to paint, the artist painted a picture of a landscape with the
 horse and the dog in the foreground, thereby, as the saying goes,
painting
 two birds on one canvas,
http://www.users.globalnet.co.uk/~ngo/bird0001.htm
 in that he achieved two results by one activity.  In addition the picture
 has various interesting details in the background, such as a windmill in
a
 plain (or is that a windmill in a plain text file).  :-)


1)  It's gif file format rather than plain text.*
2)  There isn't any windmill.

The picture of the birds has been in our family webspace since 1998 as an
illustration for the saying Painting two birds on one canvas.  That
saying, originated by me, is a peaceful saying meaning to achieve two
results by one activity.  I made the picture from clip art as a learning
exercise.

The picture of the birds is referenced as a way of illustrating the saying
Painting two birds on one canvas.  It is not the picture in the story
about which Ken asked.  I may well have a go at constructing such a picture,
perhaps using clip art.  The reference to a windmill is meant as a humourous
aside to Don Quixote tilting at windmills.

I am interested in creative writing, so when Ken asked about the story, I
just thought of something to put in my response.  Part of the training in,
and the fun of, creative writing is to be able to write something promptly
to a topic.

William Overington

16 August 2002







Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread William Overington

Tex Texin wrote as follows.

William,

So let me see if I understand this correctly.

Let's take 2 perfectly good standards, Unicode and HTML,

Yes.

and make some
very minor tweaks to them,

No.

such as changing the meaning of U+FFFC and a
special format for filenames in the beginning of the file and a new
extension, so we have something new.

I have suggested no changes whatsoever to HTML at all.

The only thing which I have suggested in relation to Unicode in this thread
is that, in relation to the fact that information about the object to which
any particular use of U+FFFC refers is kept outside the character data
stream, that it could be a good idea to define a file format .uof so that
details of the names of the files for which the U+FFFC codes are anchors
could be provided in a known format, if and only if end users chose to use a
.uof file for that purpose on that occasion and not otherwise.  This was in
the context of seeking to protect the use of U+FFFC as a character which
could be used in interchanging of documents following from the discussion of
U+FFFC and annotation characters in the thread from off of which I spun this
thread, which discussion, by Ken and Doug, is repeated in the first posting
of this present thread.

I thought it a good idea that the Unicode Technical Committee might like to
make such a .uof file format an official Unicode document so as to offer one
possible way to use U+FFFC codes.  That is now a matter for discussion.  If
the Unicode Consortium wishes to do that, then fine.  If the Unicode
Consortium chooses not to do that, then I can write it up myself and publish
it, which is not such a good solution, yet is adequate for my own needs and
might be useful for some other people if they choose to use the same format
for .uof files.

Hopefully I have now managed to raise the issue of protecting the fact that
the U+FFFC character can be used in document interchange and it will
hopefully not become deprecated to the status of a noncharacter.

There is a practical reason for this, which is, from my own perspective,
quite important.  This is as follows.

The DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) system
(details at http://www.mhp.org ) which implements my telesoftware invention.
A Java program which has been broadcast can read a Unicode plain text file
and act upon the characters within it, and can read other file formats, such
as .png files (Portable Network Graphics) and act upon the information in
those files, so as to produce a display.

So, a collection of files, namely a .uof file in the format that I suggested
it, a Unicode plain text file with one or more U+FFFC characters in it and
the appropriate graphics files in .png format as a package of free to the
end user distance education learning material being broadcast from a direct
broadcasting satellite or a terrestrial transmitter could be a very useful
facility as the way to carry text with illustrations.

Using HTML and a browser is just not the way to proceed in that situation.
HTML and a browser is a very useful technique for the web and indeed is an
option for the DVB-MHP system, yet the basic software system is Java based.
It is as if the television set is acting as a computer which has a slow read
only access disc drive in the sky from which it may gather information,
including software.  The system is interactive with no return information
link to the central broadcasting computer, by means of the telesoftware
invention.  Overlays and virtual running with programs bigger than the local
storage being able to be run using chaining techniques are possible.  Please
do not think of this as downloading as no uplink request is made!

Now the big benefit of this completely new thing,

Well, it's only a way of sender and receiver being able to have information
in a file with the suffix .uof about what objects are being anchored by
U+FFFC codes in a Unicode plain text file which it accompanies.

is that programs that
do desktop publishing can use plain text files which are not quite plain
text because they have some special formatting,

Well, the plain text files are only Unicode plain text which might contain
one or more U+FFFC characters and some of the other Unicode control
characters such as CARRIAGE RETURN.

but now they can publish
them in better manner than before.

Well, my thinking is that it would help to have a well known way to express
the meaning of the anchors encoded by U+FFFC in a file rather than having
only a vague specification that all other information about the object is
kept outside the data stream.  I am saying that, yes, all other information
about the object is kept outside the data stream and, if, and only if, end
users choose to use a .uof file in a standard format to convey that
information for some particular use of a U+FFFC code, then that format could
be considered for definition and publication by the Unicode Consortium.
That does not seem unreasonable to me.  

Re: Furigana

2002-08-16 Thread Tex Texin

John,
Why would you want them to be for internal-use only? When you exchange
regular expressions wouldn't you want operators such as any character
to be passed as well, and standardized so that there is agreement on the
meaning of the expression?

It is also not clear to me that it is desirable to encode operators of
regular expressions as individual characters, because then you get into
the slippery slope of encoding operators for every function that someone
might want, and that is what started this thread isn't it...
(But a Unicode APL operator set would be nice. ;-) )

tex

John Cowan wrote:
 
 Tex Texin scripsit:
 
  At the time (in the discussion), I don't think we had many examples of
  what the uses would be, and it wan't clear that many were needed, since
  the functionality could be arrived at with higher level protocols.
 
 One application that has always seemed obvious to me is regular expressions:
 a compiled regular expression can be represented by a Unicode string,
 with non-characters representing things like any character, zero or more,
 one or more, beginning of string, end of string, etc. etc.
 
 --
 John Cowan   [EMAIL PROTECTED]   http://www.ccil.org/~cowan
 One time I called in to the central system and started working on a big
 thick 'sed' and 'awk' heavy duty data bashing script.  One of the geologists
 came by, looked over my shoulder and said 'Oh, that happens to me too.
 Try hanging up and phoning in again.'  --Beverly Erlebacher

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Furigana

2002-08-16 Thread John Cowan

Tex Texin scripsit:

 Why would you want them to be for internal-use only? When you exchange
 regular expressions wouldn't you want operators such as any character
 to be passed as well, and standardized so that there is agreement on the
 meaning of the expression?

Regular expressions are usually interchanged using (some approximation of)
Posix syntax, so as abc.*\*, not abcANYSTAR*.  Note the phrase
compiled form in my posting.

 It is also not clear to me that it is desirable to encode operators of
 regular expressions as individual characters, because then you get into
 the slippery slope of encoding operators for every function that someone
 might want, and that is what started this thread isn't it...

Ah, but for internal use you can do what you want with the 66 non-characters
and the 4 pseudo-non-characters.

 (But a Unicode APL operator set would be nice. ;-) )

Um, we have one of those, don't we?

-- 
John Cowan
[EMAIL PROTECTED]
I am a member of a civilization. --David Brin




Re: Furigana

2002-08-16 Thread Tex Texin



John Cowan wrote:
 
 Tex Texin scripsit:
 
  Why would you want them to be for internal-use only? When you exchange
  regular expressions wouldn't you want operators such as any character
  to be passed as well, and standardized so that there is agreement on the
  meaning of the expression?
 
 Regular expressions are usually interchanged using (some approximation of)
 Posix syntax, so as abc.*\*, not abcANYSTAR*.  Note the phrase
 compiled form in my posting.

Seems like a very minor optimization then. (I am not saying undesirable,
just it is a small benefit.)
 
  It is also not clear to me that it is desirable to encode operators of
  regular expressions as individual characters, because then you get into
  the slippery slope of encoding operators for every function that someone
  might want, and that is what started this thread isn't it...
 
 Ah, but for internal use you can do what you want with the 66 non-characters
 and the 4 pseudo-non-characters.

Yes. Same thing is true for higher level protocols.
 
  (But a Unicode APL operator set would be nice. ;-) )
 
 Um, we have one of those, don't we?

Sorry, I was unclear. I meant this in the context of encoding a set of
APL-like operators for working on Unicode text to manipulate them in
regular expressions, going way beyond the any character, 0 or more
character operators.

tex

 
 --
 John Cowan
 [EMAIL PROTECTED]
 I am a member of a civilization. --David Brin

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)

2002-08-16 Thread Peter_Constable

On 08/15/2002 06:41:59 AM William Overington wrote:

In essence, though not formally, U+FFF9..U+FFFC are non-characters as
well, and the Unicode semantics just tells what programs *may* find 
them
useful for.  Unicode 4.0 editors: it might be a good idea to emphasize
the close relationship of this small repertoire with the non-characters.

That is not what the specification says.

William, John knows what he is talking about, and is exactly correct: in 
essense, though not formally, FFF9..FFFC are non-characters. No, the 
Standard doesn't say that; that's why he said, not formally. The use 
intended by the Standard is, however, exactly comparable to the 
non-characters at FDD0..FDEF. If they had been defined in the Standard as 
non-characters, the world would not be different in any meaningful way.



It appears to me that the use of the annotation characters in document
interchange is never forbidden and is strongly discouraged only where 
there
is no prior agreement between the sender and the receiver, and that that
strong discouragement is because the content may be misinterpreted
otherwise.  So, if there is a prior agreement, then there is no problem
about using them in interchanged documents.

There appears to be nothing that suggests that U+FFFC cannot be used in 
an
interchanged document.

Well, you've missed the intent of the authors of the Standard, and appear 
not to grasp the mindset. When it says interchange of IA characters may be 
OK given prior agreement, what's really in mind is that e.g. I've written 
code library A that handles some aspects of interlinear annotation, you've 
written code library B that handles different aspects of interlinear 
annotation, and we agree on certain interfaces so that my library can call 
yours or vice versa, and agree that strings passed by those interfaces can 
contain IA characters. That's the kind of thing that's in mind. It does 
*not* imply that anyone should consider create a document containing IA 
characters. 



I know little about Bliss symbols, though I have seen a few of them and 
have
read a brief introduction to them, yet it seems to me that annotating 
Bliss
symbols with English or Swedish is entirely within the specification
absolutely and would be no more than strongly discouraged even if there 
is
no prior agreement between the sender and the receiver.

Of course the Standard doesn't discourage anyone from annotating Bliss 
symbols with English or Swedish; it only discourages the use of IA 
characters as markup in documents.



Further, it seems to me from the published rules that these annotation
characters could possibly be used to provide a footnote annotation 
facility
within a plain text file

That would not be a proposal worth pursuing; in fact, I'd say it's a very 
bad idea. The reason you DO NOT want to use IA characters in a document is 
that you do not know what someone's software will do with them. The 
characters have always been intended for use by software programmers, not 
by content authors. (Ditto for the object replacement character.)



An interesting point for consideration is as to whether the following
sequence is permitted in interchanged documents...

It seems to me that if that is indeed permissible that it could 
potentially
be a useful facility.

On the whole, it would be very unwise to use these characters in documents 
for reasons I explained above. If two people agree to do this, nobody's 
going to send the Unicode police to stop them. But very few of us on this 
list are particularly interested in what is hypothetically possible for 
some pair of us to do. We're far more interested in how widely-used 
implementations should and do work, and in such implementations, 
FFF9..FFFC are assumed not to be use in content.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-15 Thread Roozbeh Pournader

On Wed, 14 Aug 2002, James Kass wrote:

 One, the use of *.html clearly violates the standard file naming
 convention of eight uppercase ASCII letters followed by a period
 followed by a *three* letter uppercase ASCII file name extension.

I was wondering if the capitalization, ASCII, is for emphasis... ;)

roozbeh





Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)

2002-08-15 Thread Kenneth Whistler

 An interesting point for consideration is as to whether the following
 sequence is permitted in interchanged documents.
 
 U+FFF9 U+FFFC U+FFFA Temperature variation with time. U+FFFB
 
 That is, the annotated text is an object replacement character and the
 annotation is a caption for a graphic.

Yes, permitted. As would also be:

U+FFF9 U+FFFC U+FFFC U+FFFA U+FFF9 Temperature U+FFFA a measure of
hotness, related to the U+FFF9 kinetic energy U+FFFA energy of motion U+FFFB
of molecules of a substance U+FFFB U+FFF9 variation U+FFFA rate of change 
U+FFFB with time U+FFFC . U+FFFB

Where the first U+FFFC is associated with a URL with a realtime data feed,
the second U+FFFC is a jar file for a 3-dimensional dynamic display algorithm,
and the third U+FFFC is a banner ad for Swatch watches.

 It seems to me that if that is indeed permissible that it could potentially
 be a useful facility.

Permissible does not imply useful, however, in this case. It is
unlikely that you are going to have access to software that would
unscramble such layering in purported plain text, even if you
had agreements with your receivers. That is what markup and rich
text formats are for.

Note that it is also *permissible* in Unicode to spell permissible
as purrmisuhbal. That doesn't mean that it would be a good idea
to do so, but the standard does not preclude you from doing so.
You could even write a rendering algorithm which would display the
sequence of Unicode characters p,u,r,r,m,i,s,u,h,b,a,l with the glyphs
{permissible} if you so choose.

--Ken






Re: Furigana

2002-08-14 Thread Doug Ewell

Tex Texin tex at i18nguy dot com wrote:

 http://www.unicode.org/unicode/uni2book/ch13.pdf

 As I read that material, I take it to be saying that senders should
 remove the I.A. characters.

What if I *want* to design an annotation-aware rendering mechanism?
Suppose I read Section 13.6 and decide that, instead of just throwing
the annotation characters away, I should attempt to display them
directly above (and smaller than) the normal text, the way furigana
are displayed above kanji.

This would work not only for typical Japanese ruby, but also for
Michael's English-or-Swedish-over-Bliss scenario.  It might even be
useful in assisting beleaguered Azerbaijanis, for example, by annotating
Latin-script text with its Cyrillic equivalent.  (Just a thought.)

Would this be conformant?

-Doug Ewell
 Fullerton, California





Re: Furigana

2002-08-14 Thread Tex Texin

The text says: except for private agreement.
So if con-senting a-d-u-l-t-s want to exchange interlinear annotated
text, that is fine.
(I hyphenated the words because some of my previous emails were rejected
by Doug's filters..)

tex

Doug Ewell wrote:
 
 Tex Texin tex at i18nguy dot com wrote:
 
  http://www.unicode.org/unicode/uni2book/ch13.pdf
 
  As I read that material, I take it to be saying that senders should
  remove the I.A. characters.
 
 What if I *want* to design an annotation-aware rendering mechanism?
 Suppose I read Section 13.6 and decide that, instead of just throwing
 the annotation characters away, I should attempt to display them
 directly above (and smaller than) the normal text, the way furigana
 are displayed above kanji.
 
 This would work not only for typical Japanese ruby, but also for
 Michael's English-or-Swedish-over-Bliss scenario.  It might even be
 useful in assisting beleaguered Azerbaijanis, for example, by annotating
 Latin-script text with its Cyrillic equivalent.  (Just a thought.)
 
 Would this be conformant?
 
 -Doug Ewell
  Fullerton, California

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




RE: Furigana

2002-08-14 Thread Michael Everson

At 16:35 -0700 2002-08-13, Murray Sargent wrote:
Michael Everson said Well then they [interlinear annotation characters]
oughtn't to have been encoded.

Michael, you aren't an implementer.

I'm not the kind of implementor you are. I do implement things. :-)

When you implement things unambiguously, you may need internal code 
points in your plain-text stream to attach higher-level protocols 
(such as formatting properties) to. Such internal code points should 
not be exported or imported.

Excuse me, this makes no sense whatsoever. If your company, for 
instance, needed INTERNAL code points to attach to higher level 
protocols, why did you not use the Private Use Area? Have I got this 
wrong? You're saying your company did want to use them but wanted 
them in the non-PUA BMP so they could -- am I getting it right -- be 
INTERCHANGED. OK, that's fine, but is it the case that these are ONLY 
allowed to be used by your company?

From your point of view perhaps, they shouldn't have been encoded. But from
an implementation point of view, they're very handy. Unicode needs to
serve both purposes. For what use would Unicode be if you couldn't
implement it effectively?

I'm saying I WANT to use these characters. They solve an apparent 
need of mine -- they would be very handy indeed, as I said in the 
Beijing meeting where they were discussed. I am mystified as to why 
people are telling me that I shouldn't because lots of applications 
may strip them out.

I am deeply confused.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: RE: Furigana

2002-08-14 Thread Michael Everson

At 17:59 -0700 2002-08-13, Kenneth Whistler wrote:

And Microsoft has others of such beasties hiding internally as
anchors for you-don't-wanna-know-what -- also not interchanged.

I am ***NOT*** bashing MS here, but what is everyone saying? That 
these characters should be annotated in the Unicode Standard as for 
Microsoft's use only?

Or is it to be for the use of anyone except Microsoft who does 
something else?

Or what!?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Furigana

2002-08-14 Thread John Cowan

James Kass scripsit:

 Once a meaning like
 INTERLINEAR ANNOTATION ANCHOR has been assigned to
 a code point, any application which chooses to use that code
 point for any other purpose would be at fault.

But a purely nominal one, since any use of these three codepoints
should be behind the firewall of the application.

 I understand that having common internal use code points might
 be considered handy from an implementer's point of view, but
 suggest that such conventions should be shared among implementers
 only, and should not be enshrined in a character encoding standard.

I doubt you will see any more such things.  BTW, note that FFFC has
the same internal-only property.

 Because it seems to be an oxymoron.  If it has a specific semantic
 meaning, then it should be possible to store and exchange it
 without any loss of meaning.  

For what seemed to them good and sufficient reasons, the UTC did
not do this: they allocated the points but proscribed them from
use in interchange.  Had they thought of the permanent non-character
block at the time, they probably would not have done this.

-- 
John Cowanhttp://www.ccil.org/~cowan  [EMAIL PROTECTED]
Please leave your values|   Check your assumptions.  In fact,
   at the front desk.   |  check your assumptions at the door.
 --sign in Paris hotel  |--Miles Vorkosigan




Re: Furigana

2002-08-14 Thread John Cowan

Michael Everson scripsit:

 Excuse me, this makes no sense whatsoever. If your company, for 
 instance, needed INTERNAL code points to attach to higher level 
 protocols, why did you not use the Private Use Area?  

Well, suppose I wanted to use a codepoint internally to a program for
some purpose or other -- for example, to indicate the point at which
a graphic was to be inserted in the final HTML output.  If I allocated
U+E000 to that purpose, then that program could not be used to process
CSUR Tengwar text.  Thus it is useful to have non-character codepoints,
which are not meant to be interchanged, as well as PUA codepoints,
which are meant to be interchanged under private agreements.

In essence, though not formally, U+FFF9..U+FFFC are non-characters as
well, and the Unicode semantics just tells what programs *may* find them
useful for.  Unicode 4.0 editors: it might be a good idea to emphasize
the close relationship of this small repertoire with the non-characters.

-- 
John Cowan  http://www.ccil.org/~cowan[EMAIL PROTECTED]
To say that Bilbo's breath was taken away is no description at all.  There are
no words left to express his staggerment, since Men changed the language that
they learned of elves in the days when all the world was wonderful. --The Hobbit




Re: Furigana

2002-08-14 Thread Michael Everson

At 20:09 -0700 2002-08-12, Doug Ewell wrote:

Everybody will welcome the new conventional, graphical-type 
characters and scripts that are coming with Unicode 4.0.  But maybe 
before standardizing another COMBINING GRAPHEME JOINER or other 
control-type character, it would be prudent to study the angles even 
more thoroughly and carefully, and make *damn* sure the character is 
going to be usable and not discouraged or even deprecated at birth.

I tried to get some proper discussion papers on the CGJ from its 
advocates but they never really appeared. I still have misgivings 
about this beastly thing.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




RE: Furigana

2002-08-14 Thread Marco Cimarosti

Doug Ewell wrote:
 I'll have to check with Adelphia and see who or what is trying to
 protect me from myself.

Those automatic b*llsh*ts!

A few years ago I was temporarily assigned to the central national office of
my previous employer. It was when the Unicode list was discussing something
about the *Mongolian* script: each time I tried to access some website
having information about the script, the server replied Access to sites
advocating abortion is banned on this system!

_ Marco




Re: Furigana

2002-08-14 Thread Kenneth Whistler

Doug (and Michael also):

 What if I *want* to design an annotation-aware rendering mechanism?
 Suppose I read Section 13.6 and decide that, instead of just throwing
 the annotation characters away, I should attempt to display them
 directly above (and smaller than) the normal text, the way furigana
 are displayed above kanji.
 
 This would work not only for typical Japanese ruby, but also for
 Michael's English-or-Swedish-over-Bliss scenario.  It might even be
 useful in assisting beleaguered Azerbaijanis, for example, by annotating
 Latin-script text with its Cyrillic equivalent.  (Just a thought.)
 
 Would this be conformant?

Well, technically conformant, but not wise. If commonly available
display and rendering mechanisms are not rendering them as interlinear
annotations, then you aren't really providing much assistance here
by using a mechanism designed for internal anchors and trying to
turn it into something it isn't really up to snuff for.

Frankly, you would be much better off making use of the Ruby annotation
schemes available in markup languages, which will give you better
scoping and attribute mechanisms.

Stop worrying a moment about Why are these characters standardized,
and why the hedoublehockeysticks can't I use them?! and think about
the problem that furigana or any other interlinear annotation rendering
system has to address:

  a. How are the annotations adjusted? Left-adjusted, centered,
 something else? And what point(s) are they synched on?

  b. If the annotated text or the annotation itself consist of
 multiple units, are there subalignments? E.g.

   note note note  note
   text text textextextext text

or

   note note  note note
   text text textextextext text

  c. Can an annotation itself be stacked into a multiline form?

   note note note
 nononononote
   text

  d. Can the text of the annotation itself in turn be annotated?

  e. Can the text have two or more coequal annotations? And if so,
 how are they aligned?

  e. If the annotation is in a distinct style from the text it
 annotates, how is that indicated and controlled?

  f. How is line-break controlled on a line which also has an
 annotation?

And so on. This is all the kind of stuff that clearly smacks to me
of document formatting concerns and rich text. Why anyone would consider
such things to be plain text rather escapes me.

--Ken




Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-14 Thread Kenneth Whistler

William Overington teased us all unmercifully with:

 It occurs to me that it is possible to introduce a convention, either as a
 matter included in the Unicode specification, or as just a known about
 thing, that if one has a plain text Unicode file with a file name that has
 some particular extension (any ideas for something like .uof for Unicode
 object file) 

...or to pick an extension, more or less at random, say .html

 that accompanies another plain text Unicode file which has a
 file name extension such as .txt, or indeed other choices except .uof (or
 whatever is chosen after discussion) then the convention could be that the
 .uof file has on lines of text, in order, the name of the text file then the
 names of the files which contains each object to which a U+FFFC character
 provides the anchor.
 
 For example, a file with a name such as story7.uof might have the following
 lines of text as its contents.
 
 story7.txt
 horse.gif
 dog.gif
 painting.jpg

This is a shaggy dog story, right?

 
 The file story7.uof could thus be used with a file named story.txt so as to
 indicate which objects were intended to be used for three uses of U+FFFC in
 the file story7.txt, in the order in which they are to be used.

Or we could go even further, and specify that in the story7.html file,
the three uses of those objects could be introduced with a very specific
syntax that would not only indicate the order that they occur in, but
could indicate the *exact* location one could obtain the objects -- either on 
one's own machine or even anywhere around the world via the Internet! And we could 
even include a mechanism for specifying the exact size that the object should be
displayed. For example, we could use something like:

img src=http://www.coteindustries.com/dogs/images/dogs4.jpg; width=380
 height=260 border=1

or

img src=http://www.artofeurope.com/velasquez/vel2.jpg;

 I can imagine that such a widely used practice might be helpful in bridging
 the gap between being able to use a plain text file or maybe having to use
 some expensive wordprocessing package.

And maybe someone will write cheaper software -- we could call it a browser --
that could even be distributed for free, so that people could make use of
this convention for viewing objects correctly distributed with respect to
the text they are embedded in.

Yes, yes, I think this is an idea which could fly.

--Ken






Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-14 Thread James Kass


Kenneth Whistler wrote in response to William Overington,

 
 ...or to pick an extension, more or less at random, say .html
 

  The file story7.uof could thus be used with a file named story.txt so as to
  indicate which objects were intended to be used for three uses of U+FFFC in
  the file story7.txt, in the order in which they are to be used.
 
 Or we could go even further, and specify that in the story7.html file,
 the three uses of those objects could be introduced with a very specific
 syntax that would not only indicate the order that they occur in, but
 could indicate the *exact* location one could obtain the objects -- either on 
 one's own machine or even anywhere around the world via the Internet! And we could 
 even include a mechanism for specifying the exact size that the object should be
 displayed. For example, we could use something like:
 
 img src=http://www.coteindustries.com/dogs/images/dogs4.jpg; width=380
  height=260 border=1
 
 And maybe someone will write cheaper software -- we could call it a browser --
 that could even be distributed for free, so that people could make use of
 this convention for viewing objects correctly distributed with respect to
 the text they are embedded in.
 
 Yes, yes, I think this is an idea which could fly.
 

Well, there might be some serious objections to such a proposal.

One, the use of *.html clearly violates the standard file naming
convention of eight uppercase ASCII letters followed by a period
followed by a *three* letter uppercase ASCII file name extension.

Secondly, the use of the greater-than and less-than ASCII characters
to denote the mark-up sure appears to be a misuse of those 
characters.  This may well cause too much confusion in parsing.

3superscriptrd/superscript, the cost of development of these 
hypothetical browsers would be quite high, and we couldn't really 
expect any such expensive software to be literally given away.  
There would have to be some catch to it all, wouldn't there?

Best regards,

James Kass,
(P.S. - The point of this response is that maybe we shouldn't 
hastily reject new concepts just because they seem to fly
in the face of existing practices. - JK)






The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)

2002-08-14 Thread William Overington

John Cowan wrote as follows.

In essence, though not formally, U+FFF9..U+FFFC are non-characters as
well, and the Unicode semantics just tells what programs *may* find them
useful for.  Unicode 4.0 editors: it might be a good idea to emphasize
the close relationship of this small repertoire with the non-characters.

That is not what the specification says.  Something can only be emphasised
if it is true in the first place!  If it is desired to make U+FFF9 through
to U+FFFC noncharacters then that needs to be done explicitly with a fair
opportunity for people to object and make representations before a decision
is made.

A saying of my own is as follows.

When goalposts are moved, aromatic herbs should be scattered around.

It seems to me, not having known about annotation characters previously,
yet, due to this thread now having read the published rules in Chapter 13
that these are not noncharacters.

It appears to me that the use of the annotation characters in document
interchange is never forbidden and is strongly discouraged only where there
is no prior agreement between the sender and the receiver, and that that
strong discouragement is because the content may be misinterpreted
otherwise.  So, if there is a prior agreement, then there is no problem
about using them in interchanged documents.

There appears to be nothing that suggests that U+FFFC cannot be used in an
interchanged document.

I know little about Bliss symbols, though I have seen a few of them and have
read a brief introduction to them, yet it seems to me that annotating Bliss
symbols with English or Swedish is entirely within the specification
absolutely and would be no more than strongly discouraged even if there is
no prior agreement between the sender and the receiver.

Further, it seems to me from the published rules that these annotation
characters could possibly be used to provide a footnote annotation facility
within a plain text file, so that, if a plain text file is being printed out
in book format, then a footnote about a word or phrase could be encoded
using this technique so that the rendering software could place the footnote
on the same page as the word or phrase which is being annotated, regardless
of whether that word or phrase occurs near the start, middle or end of that
page.  It seems to me that the statement of the meaning of U+FFFA means that
Figure 13-3 of the specification are just examples, though as the word exact
is used, perhaps they are guiding examples and the use in footnotes is
perhaps stretching the variation from the examples in the diagram.

An interesting point for consideration is as to whether the following
sequence is permitted in interchanged documents.

U+FFF9 U+FFFC U+FFFA Temperature variation with time. U+FFFB

That is, the annotated text is an object replacement character and the
annotation is a caption for a graphic.

It seems to me that if that is indeed permissible that it could potentially
be a useful facility.

On balance, it seems to me that if both sender and receiver are clear as to
what is meant, then the use of annotation characters for Bliss symbols and
for footnotes and for captions for illustrations harms no one, for a person
skilled in the art seeking to use the file without knowledge of the
interpretation agreement which should ideally exist between sender and
receiver and who has only the Unicode specification to go on would probably
be unlikely to get a wrong interpretation of the intended meaning, even if
the actual graphical layout were imprecise, as the Unicode standard locks
together the two parts of the annotation sequence and shows that one of the
parts is the annotation for the other part.

William Overington

15 August 2002



.








Re: Furigana

2002-08-13 Thread Michael Everson

At 12:11 -0700 2002-08-08, Kenneth Whistler wrote:

Ah, but read the caveats carefully. The Unicode interlinear
annotation characters are *not* intended for interchange, unlike
the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially,
internal-use anchor points.

What does this mean? That if I have a text all nice and marked up 
with furigana in Quark I can't export it to Word and reimport it in 
InDesign and expect my nice marked up text to still be marked up?

Surely all Unicode/10646 characters are expected to be preserved in 
interchange. What have I got wrong, Ken?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Furigana

2002-08-13 Thread Michael Everson

At 19:59 +0900 2002-08-08, Dan Kogai wrote:
On Thursday, August 8, 2002, at 04:17 , Michael Everson wrote:
Where do I start looking for information about implementing 
furigana? Can you have more than one gloss attached to a word? We 
are considering implementing this for Blissymbols.

What do you mean by implementing?  Or to what extent do you want 
furigana implemented?

I want to be able to send a Blissymbol string with a gloss in English 
or Swedish attached. Nothing to do with Japanese whatsoever.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




RE: Furigana

2002-08-13 Thread Murray Sargent

As Ken says the Unicode interlinear annotation characters are for
internal use only. Specifically, their meanings can be different for
different programs. If you have your nice marked up text in memory and
want to export it for use by some program, you need to use a
higher-level protocol that translates the interlinear annotation
characters to a standardized external format, such as HTML. In addition
to U+FFF9 - U+FFFB, there are other characters for internal use only,
namely U+FDD0 - U+FDEF. The meanings of these characters also can (and
do) differ for different programs. Originally it was hoped that the
interlinear annotation characters might be able to describe ruby
adequately, but it became clear that additional information is necessary
to express ruby unambiguously. Hence the UTC adopted them for internal
use only, with associated information presumably stored elsewhere to
resolve the ambiguities.

Frankly IMHO the best thing for a program to do with reading such
characters is to delete them. This isn't quite what one might think from
the Standard since they unfortunately aren't labeled as noncharacters.
But if a program uses them internally with a well defined meaning,
getting them in from an external source can violate the internal usage.
To actually roundtrip these rogue characters would require some extra
internal protocol to ignore them when they've been read in. So my edit
engine (RichEdit), which uses them for table row delimiters, simply
deletes them on input and only exports them for RichEdit-specific
contexts.

Murray

-Original Message-
From: Michael Everson [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, August 13, 2002 7:52 AM
To: [EMAIL PROTECTED]
Cc: Ken Whistler
Subject: Re: Furigana


At 12:11 -0700 2002-08-08, Kenneth Whistler wrote:

Ah, but read the caveats carefully. The Unicode interlinear annotation 
characters are *not* intended for interchange, unlike the HTML4 ruby 
tag. See TUS 3.0, p. 326. They are, essentially, internal-use anchor 
points.

What does this mean? That if I have a text all nice and marked up 
with furigana in Quark I can't export it to Word and reimport it in 
InDesign and expect my nice marked up text to still be marked up?

Surely all Unicode/10646 characters are expected to be preserved in 
interchange. What have I got wrong, Ken?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com





Re: Furigana

2002-08-13 Thread Kenneth Whistler


 I want to be able to send a Blissymbol string with a gloss in English 
 or Swedish attached. Nothing to do with Japanese whatsoever.

Basically, as for all things annotational or interlineating, this
is an excellent application for markup.

--Ken





Re: Furigana

2002-08-13 Thread Philipp Reichmuth

Hi Michael,

ME I want to be able to send a Blissymbol string with a gloss in
ME English or Swedish attached.

Do you need this in plain text? If I understand Blissymbols correctly,
this is just to give an explanation of the Blissymbol string, much
like giving the Pinyin pronunciation to a Han ideograph or giving IPA
for a native orthography in linguistics textbooks.

  Philippmailto:[EMAIL PROTECTED]





Re: Furigana

2002-08-13 Thread Michael Everson

At 14:16 -0700 2002-08-13, Kenneth Whistler wrote:
   I want to be able to send a Blissymbol string with a gloss in English
  or Swedish attached. Nothing to do with Japanese whatsoever.

Basically, as for all things annotational or interlineating, this
is an excellent application for markup.

When this was discussed in WG2 in Japan before they went in, I asked 
specifically, could I use this method to put Anglo-Saxon glosses on 
Latin text. The answer was positive, so it received my support. Were 
these always pre-deprecated? Why are they in the standard if no one 
is going to be allowed to use them?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Furigana

2002-08-13 Thread Michael Everson

At 23:50 +0200 2002-08-13, Philipp Reichmuth wrote:
Hi Michael,

ME I want to be able to send a Blissymbol string with a gloss in
ME English or Swedish attached.

Do you need this in plain text?

We are exploring what to do.

If I understand Blissymbols correctly,
this is just to give an explanation of the Blissymbol string, much
like giving the Pinyin pronunciation to a Han ideograph or giving IPA
for a native orthography in linguistics textbooks.

But Blissymbols are most often transmitted (in gifs for instance) 
with glosses which help people not literate in Blissymbols but able 
to read other languages to understand what is being said.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Furigana

2002-08-13 Thread Kenneth Whistler

Michael,

 At 14:16 -0700 2002-08-13, Kenneth Whistler wrote:
I want to be able to send a Blissymbol string with a gloss in English
   or Swedish attached. Nothing to do with Japanese whatsoever.
 
 Basically, as for all things annotational or interlineating, this
 is an excellent application for markup.
 
 When this was discussed in WG2 in Japan before they went in, I asked 
 specifically, could I use this method to put Anglo-Saxon glosses on 
 Latin text. The answer was positive, so it received my support. Were 
 these always pre-deprecated? Why are they in the standard if no one 
 is going to be allowed to use them?

Read the discussion which has been published in the Unicode Standard
ever since these things were available. TUS 3.0, pp. 325 - 326.

The annotation characters are used in internal processing when
   ^^^
 out-of-band information is associated with a character stream, very
 similarly to the usage of the U+FFFC OBJECT REPLACEMENT CHARACTER...

Usage of the annotation characters in plain text interchange is
 strongly discouraged without prior agreement between the sender
 
 and received because the content may be misinterpreted otherwise...


When an output for plain text usage is desired and when the receiver
^
 is unknown to the sender, these interlinear annotation characters
 
 should be removed...
 ^

The Japanese national body was very clear about this, and was opposed
to these going into the standard unless such clarifications were made,
to ensure that these were not intended for plain text interchange
of furigana (or other similar annotations).

--Ken





Re: Furigana

2002-08-13 Thread Michael Everson

At 16:00 -0700 2002-08-13, Kenneth Whistler wrote

The Japanese national body was very clear about this, and was opposed
to these going into the standard unless such clarifications were made,
to ensure that these were not intended for plain text interchange
of furigana (or other similar annotations).

Well then they oughtn't to have been encoded.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Furigana

2002-08-13 Thread Kenneth Whistler

Michael Everson (in training as a curmudgeon) harrumpfed ;-)

 The Japanese national body was very clear about this, and was opposed
 to these going into the standard unless such clarifications were made,
 to ensure that these were not intended for plain text interchange
 of furigana (or other similar annotations).
 
 Well then they oughtn't to have been encoded.

Yes, we agree that hindsight is a wonderful skill. This function
would better be served by noncharacter code points, but nobody
had quite figured out how to articulate that yet.

But even at the time, as the record of the deliberations would
show, if we had a more perfect record, the proponents were clear
that the interlinear annotation characters were to solve an
internal anchor point representation problem. Nobody (well, maybe
somebody) expected them to serve as a substitute for a general
markup mechanism for indication of annotation, and in particular,
interlinear annotations. I recall at the time I pointed out that
as a linguist I had routinely made use of 4-line interlinear
annotation formats, and that this simple anchoring scheme couldn't
even begin to represent such complexities in a usable fashion.

--Ken




RE: Furigana

2002-08-13 Thread Murray Sargent

Michael Everson said Well then they [interlinear annotation characters]
oughtn't to have been encoded.

Michael, you aren't an implementer. When you implement things
unambiguously, you may need internal code points in your plain-text
stream to attach higher-level protocols (such as formatting properties)
to. Such internal code points should not be exported or imported. From
your point of view perhaps, they shouldn't have been encoded. But from
an implementation point of view, they're very handy. Unicode needs to
serve both purposes. For what use would Unicode be if you couldn't
implement it effectively? 

Murray




Re: RE: Furigana

2002-08-13 Thread starner

Michael, you aren't an implementer. When you implement things
unambiguously, you may need internal code points in your plain-text
stream to attach higher-level protocols (such as formatting properties)
to. 

That seems to be basically what William Overington is proposing,
except these characters only handle furigana, instead all markup.




Re: Furigana

2002-08-13 Thread Tex Texin

Murray,

It's true implementers need  some place to attach higher level
protocols, but they don't need specific points for specific
implementations of internal protocols. If they weren't good enough to be
used for exchange, then simply having some unpurposed code points
available for internal use accomplishes the same thing and is available
for other purposes as well. But at the time the annotation characters
were introduced, we were unclear about this.

tex

Murray Sargent wrote:
 
 Michael Everson said Well then they [interlinear annotation characters]
 oughtn't to have been encoded.
 
 Michael, you aren't an implementer. When you implement things
 unambiguously, you may need internal code points in your plain-text
 stream to attach higher-level protocols (such as formatting properties)
 to. Such internal code points should not be exported or imported. From
 your point of view perhaps, they shouldn't have been encoded. But from
 an implementation point of view, they're very handy. Unicode needs to
 serve both purposes. For what use would Unicode be if you couldn't
 implement it effectively?
 
 Murray

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Furigana

2002-08-13 Thread Tex Texin

Ken, 

http://www.unicode.org/unicode/uni2book/ch13.pdf

As I read that material, I take it to be saying that senders should
remove the I.A. characters.
Does the standard discuss anywhere filtering the characters on the
receiver side?

Clearly Murray has good justification for removing the I.A. characters
as it interferes with his use of the code points internally. (Although I
am sure that could also be designed in ways to preserve them if
absolutely needed.)

But does the standard address their removal by receivers (or
intermediaries) , and does removing them include removing the contained
annotation?

I can imagine an application that doesn't support I.A. deciding the
annotation is out of band and can't be preserved in its plain text
output, and so justifiably strips it as well.
Does the standard say what to do with for internal use only
characters?

I would have thought the rule was to ignore and pass along.


Kenneth Whistler wrote:
 
 Michael,
 
  At 14:16 -0700 2002-08-13, Kenneth Whistler wrote:
 Read the discussion which has been published in the Unicode Standard
 ever since these things were available. TUS 3.0, pp. 325 - 326.

 
 The annotation characters are used in internal processing when
^^^
  out-of-band information is associated with a character stream, very
  similarly to the usage of the U+FFFC OBJECT REPLACEMENT CHARACTER...
 
 Usage of the annotation characters in plain text interchange is
  strongly discouraged without prior agreement between the sender
  
  and received because the content may be misinterpreted otherwise...
 
 When an output for plain text usage is desired and when the receiver
 ^
  is unknown to the sender, these interlinear annotation characters
  
  should be removed...
  ^
 
 The Japanese national body was very clear about this, and was opposed
 to these going into the standard unless such clarifications were made,
 to ensure that these were not intended for plain text interchange
 of furigana (or other similar annotations).
 
 --Ken

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Furigana

2002-08-13 Thread Kenneth Whistler

Tex asked:

 But does the standard address their removal by receivers (or
 intermediaries) , and does removing them include removing the contained
 annotation?

Yes and yes. p. 326:

On input, a plain text receiver should either preserve all characters
^^
or remove the interlinear annotation characters as well as the annotating
   ^^
text...


 
 I can imagine an application that doesn't support I.A. deciding the
 annotation is out of band and can't be preserved in its plain text
 output, and so justifiably strips it as well.
 Does the standard say what to do with for internal use only
 characters?

Yes. Unicode 3.1:

D7b: Noncharacter: a code point that is permanently reserved for
 internal use, and that should never be interchanged.

C10: A process shall make no change in a valid coded character
 representation other than the possible replacement of
 character sequences by their canonical-equivalent sequences
 or the deletion of noncharacter code points, if that process
 purports not to modify the interpretation of that coded
 character sequence.

The interlinear annotation characters fall in a gray zone, since
they are not noncharacters, but by rights ought to have been.
Since they are standard characters though, the standard has to
provide some guidelines -- and it is simply safer, if you encounter
and delete them, to also delete the annotation. You would be changing
the interpretation of the text, but in a knowing, intended manner.

 
 I would have thought the rule was to ignore and pass along.

In general, yes, as for everything else, including unassigned
code points. If your role in life is as a database, for example,
or some other kind of data source or data pipe, then minimal
meddling with the bytes is safest. But other kinds of processes
will do graduated manipulations, depending on what they are
aiming for.

--Ken




RE: Furigana

2002-08-13 Thread Murray Sargent

I agree. The current thinking is that U+FFF9 - U+FFFB are have no
external meaning and shouldn't appear externally, i.e., they are
noncharacters in every way except in the spec (sigh). They can be used
for whatever an implementer wants internally. I mentioned earlier that
the RichEdit edit engine uses them for table-row delimiters, which have
nothing to do with Furigana. Instead, RichEdit 5.0 uses codes from the
U+FDD0 - U+FDEF block for Furigana and various 2D math objects.

Thanks
Murray

-Original Message-
From: Tex Texin [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, August 13, 2002 6:11 PM
To: Murray Sargent
Cc: Michael Everson; [EMAIL PROTECTED]
Subject: Re: Furigana


Murray,

It's true implementers need  some place to attach higher level
protocols, but they don't need specific points for specific
implementations of internal protocols. If they weren't good enough to be
used for exchange, then simply having some unpurposed code points
available for internal use accomplishes the same thing and is available
for other purposes as well. But at the time the annotation characters
were introduced, we were unclear about this.

tex





Re: Furigana

2002-08-13 Thread Tex Texin

Thanks Ken. I don't know how I missed the text on 326 when I scanned it
before I mailed.
tex

Kenneth Whistler wrote:
 
 Tex asked:
 
  But does the standard address their removal by receivers (or
  intermediaries) , and does removing them include removing the contained
  annotation?
 
 Yes and yes. p. 326:
 
 On input, a plain text receiver should either preserve all characters
 ^^
 or remove the interlinear annotation characters as well as the annotating
^^
 text...
 
 
 
  I can imagine an application that doesn't support I.A. deciding the
  annotation is out of band and can't be preserved in its plain text
  output, and so justifiably strips it as well.
  Does the standard say what to do with for internal use only
  characters?
 
 Yes. Unicode 3.1:
 
 D7b: Noncharacter: a code point that is permanently reserved for
  internal use, and that should never be interchanged.
 
 C10: A process shall make no change in a valid coded character
  representation other than the possible replacement of
  character sequences by their canonical-equivalent sequences
  or the deletion of noncharacter code points, if that process
  purports not to modify the interpretation of that coded
  character sequence.
 
 The interlinear annotation characters fall in a gray zone, since
 they are not noncharacters, but by rights ought to have been.
 Since they are standard characters though, the standard has to
 provide some guidelines -- and it is simply safer, if you encounter
 and delete them, to also delete the annotation. You would be changing
 the interpretation of the text, but in a knowing, intended manner.
 
 
  I would have thought the rule was to ignore and pass along.
 
 In general, yes, as for everything else, including unassigned
 code points. If your role in life is as a database, for example,
 or some other kind of data source or data pipe, then minimal
 meddling with the bytes is safest. But other kinds of processes
 will do graduated manipulations, depending on what they are
 aiming for.
 
 --Ken

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Furigana

2002-08-13 Thread James Kass


Kenneth Whistler wrote,

 The interlinear annotation characters fall in a gray zone, since
 they are not noncharacters, but by rights ought to have been.
 Since they are standard characters though, the standard has to
 provide some guidelines -- and it is simply safer, if you encounter
 and delete them, to also delete the annotation. You would be changing
 the interpretation of the text, but in a knowing, intended manner.
 

Should a character encoding standard ever encode a non-character?
Is there such a thing as a non-character with a specific semantic 
meaning?  Can't apps needing internal processing code points which 
are only going to be deleted before export simply use the PUA? 
If the PUA isn't acceptable, and the existing code points reserved
for undefined non-characters isn't large enough, wouldn't it be
better to assign a range of undefined non-characters in one of 
the higher planes for these internal processing needs?

No application should delete anything without first asking the user's
permission.

Imagine spending considerable time and effort getting a text to
look just as desired only to have some application arbitrarily decide
to delete half of it without your permission or knowledge.

Best regards,

James Kass.






Re: Furigana

2002-08-13 Thread John Cowan

James Kass scripsit:

 Should a character encoding standard ever encode a non-character?

Non-characters aren't encoded, they're reserved either for specific
purposes or for any desired purpose.

 Is there such a thing as a non-character with a specific semantic 
 meaning?

Why not?

 Can't apps needing internal processing code points which 
 are only going to be deleted before export simply use the PUA? 

No, because they may need the PUA to represent characters interchanged
under a private agreement.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
In computer science, we stand on each other's feet.
--Brian K. Reid




Re: Furigana

2002-08-13 Thread James Kass


John Cowan wrote,

 Non-characters aren't encoded, they're reserved either for specific
 purposes or for any desired purpose.


If it's a specific purpose, it seems like it should either fall under
character or mark-up.

I can understand reserving code points for any desired purpose,
such as control characters or escape sequences.  These may well
differ from application to application.  Once a meaning like
INTERLINEAR ANNOTATION ANCHOR has been assigned to
a code point, any application which chooses to use that code
point for any other purpose would be at fault.

In other words, if these characters are to be used internally for
Japanese Ruby (furigana), etc., then they ought to be able to
be used externally, as well.

I understand that having common internal use code points might
be considered handy from an implementer's point of view, but
suggest that such conventions should be shared among implementers
only, and should not be enshrined in a character encoding standard.
 
  Is there such a thing as a non-character with a specific semantic 
  meaning?
 
 Why not?
 

Because it seems to be an oxymoron.  If it has a specific semantic
meaning, then it should be possible to store and exchange it
without any loss of meaning.  In other words, it's a character
and should be so encoded.  (Logos and such notwithstanding.)

Best regards,

James Kass.





Re: Furigana

2002-08-12 Thread Kenneth Whistler

Michael asked:

 At 12:11 -0700 2002-08-08, Kenneth Whistler wrote:
 
 Ah, but read the caveats carefully. The Unicode interlinear
 annotation characters are *not* intended for interchange, unlike
 the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially,
 internal-use anchor points.
 
 What does this mean? That if I have a text all nice and marked up 
 with furigana in Quark I can't export it to Word and reimport it in 
 InDesign and expect my nice marked up text to still be marked up?

Yes, among other things.

 
 Surely all Unicode/10646 characters are expected to be preserved in 
 interchange. What have I got wrong, Ken?

Your expectation that this stuff will actually work that way.

Yes, the characters will be preserved in interchange. But the
most likely result you will get is:

anchor1textanchor2annotationanchor3

where the anchors will just be blorts. You should not expect that
the whole annotation *framework* will be implemented, and certainly
not that these three characters will suffice for nice[ly] marked up...
furigana.

These animals are more like U+FFFC -- they are internal anchors
that should not be exported, as there is no general expectation
that once exported to plain text, a receiver will have sufficient
context for making sense of them in the way the originator was
dealing with them internally.

By rights, this whole problem of synchronizing the internal anchor
points for various ruby schemes should have been handled by
noncharacters -- but that mechanism was not really understood
and expanded sufficiently until after the interlinear annotation
characters were standardized.

--Ken






Re: Furigana

2002-08-12 Thread Doug Ewell

Kenneth Whistler kenw at sybase dot com wrote:

 Surely all Unicode/10646 characters are expected to be preserved in
 interchange. What have I got wrong, Ken?

 Your expectation that this stuff will actually work that way.

 Yes, the characters will be preserved in interchange. But the
 most likely result you will get is:

 anchor1textanchor2annotationanchor3

 where the anchors will just be blorts. You should not expect that
 the whole annotation *framework* will be implemented, and certainly
 not that these three characters will suffice for nice[ly] marked
 up... furigana.

I don't have any problem with the idea that many, or even all, of
today's applications lack meaningful support for ideographical
annotation characters, and will display them as blorts, and I doubt that
Michael expects widespread support for them either.  What worries me is
what Ken saus next:

 These animals are more like U+FFFC -- they are internal anchors
 that should not be exported, as there is no general expectation
 that once exported to plain text, a receiver will have sufficient
 context for making sense of them in the way the originator was
 dealing with them internally.

 By rights, this whole problem of synchronizing the internal anchor
 points for various ruby schemes should have been handled by
 noncharacters -- but that mechanism was not really understood
 and expanded sufficiently until after the interlinear annotation
 characters were standardized.

This moves the entire issue out of the realm of poor support and into
the big, dark, scary cavern of pre-deprecation.

Unicode 3.0 doesn't say exactly what Ken says.  Unicode 3.0 (p. 326)
says the annotation characters should only be used under prior
agreement between the sender and the receiver because the content may be
misinterpreted otherwise.  Fine, no problem; those are the same rules
that apply to the PUA.  Ken, though, seems to say they shouldn't be
exported at all, and furthermore they shouldn't even have been encoded
in the first place, except that the noncharacters (which explicitly
mustn't be interchanged) hadn't been invented yet.

This sounds like Plane 14, or the combining Vietnamese tone marks, all
over again -- Unicode (and/or WG2) invents a mechanism, but then wishes
they hadn't, or thinks of a better way, so the mechanism is strongly
discouraged and eventually deprecated.  (Not that I liked the separate
Vietnamese tone marks; don't get me wrong.)

Some groups, like IDN and the security mavens, criticize Unicode for its
perceived instability.  A lot of the attention seems to revolve around
gray areas of normalization and bidi, or confusable glyphs (what I call
spoof buddies).  Can I suggest that a potentially larger source of
instability comes from the creation of characters and encoding
mechanisms that are subsequently discouraged or deprecated because maybe
they weren't fully thought out in the first place?  The approval process
in Unicode, and especially WG2, is a slow one, and some of these on
second thought decisions race ahead of the approval process, so that
the mechanisms are already doomed by the time of publication.

Everybody will welcome the new conventional, graphical-type characters
and scripts that are coming with Unicode 4.0.  But maybe before
standardizing another COMBINING GRAPHEME JOINER or other control-type
character, it would be prudent to study the angles even more thoroughly
and carefully, and make *damn* sure the character is going to be usable
and not discouraged or even deprecated at birth.

(No, I have never been involved in the character standardization
process -- but I *have* been on committees that encoded other types of
things too hastily and then had to find a way to take back their
decision.)

-Doug Ewell
 Fullerton, California





Re: Furigana

2002-08-10 Thread Michael Everson

At 12:11 -0700 2002-08-08, Kenneth Whistler wrote:

Ah, but read the caveats carefully. The Unicode interlinear
annotation characters are *not* intended for interchange, unlike
the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially,
internal-use anchor points.

What does this mean? That if I have a text all nice and marked up 
with furigana in Quark I can't export it to Word and reimport it in 
InDesign and expect my nice marked up text to still be marked up?

Surely all Unicode/10646 characters are expected to be preserved in 
interchange. What have I got wrong, Ken?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Furigana

2002-08-08 Thread Kenneth Whistler

Stefan wrote:

  Many Japanese word processors already have that capability.  HTML4 has 
  ruby tag exactly for that purpose.
 
 And Unicode has characters for that purpose, too.
 
   Unicode: U+FFF9 kanji U+FFFA furigana U+FFFB  
   HTML4:  RUBYRD  kanji  /RDRT  furigana  /RT/RUBY 
 
 
 Examples:
 ?$B4A;z(B?$B$U$j$,$J(B?
 $B4A;z$U$j$,$J(B

Ah, but read the caveats carefully. The Unicode interlinear
annotation characters are *not* intended for interchange, unlike
the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially,
internal-use anchor points.

--Ken






Re: Furigana can be katakana

2002-01-25 Thread Stefan Persson

- Original Message -
From: ろ ろ〇〇〇 [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: den 25 januari 2002 23:23
Subject: Furigana can be katakana

 In my Love Hina vol 7, 千年 has furigana ミレニアム.

In cases such as ?瑞典?スウェーデン? (is the furigana encoded correctly?) the
furigana should always be written in katakana, right?

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





RE: Furigana codes?

2000-07-06 Thread Marco . Cimarosti

Daniel Biddle wrote:
 On Wed, 5 Jul 2000, Rick McGowan wrote:
  iRck
 I thought this was a typo until I saw your address. U263A

It's not a typo: Rick's signature has passed through an Indic renderer, so
the "i" was reordered. U+FF1AU+FF0DU+FF09

_ Maco`



Re: Furigana codes?

2000-07-05 Thread Rick McGowan

Will someone PLEASE send this boy a book!?

iRck



Begin forwarded message:

 From: [EMAIL PROTECTED] 
 Date: Sat, 01 Jul 2000 02:49:30 -0800 (GMT-0800) 
 To: Unicode List [EMAIL PROTECTED] 
 Subject: Furigana codes? 
 X-UML-Sequence: 14481 (2000-07-01 10:49:31 GMT) 
  
 Are there furigana codes? If not, there darn well need to be. 
 Like: BEGIN WHAT THE FURIGANA IS FOR, then START FURIGANA, then END FURIGANA. 
  



Re: Furigana codes?

2000-07-01 Thread Michael \(michka\) Kaplan

From: [EMAIL PROTECTED]
 Are there furigana codes? If not, there darn well need to be.
 Like: BEGIN WHAT THE FURIGANA IS FOR, then START FURIGANA, then END
FURIGANA.

AFAIK, Furigana is not made up of separate code points it is text that
can be Hiragana, Katakana, or Romanji.

There are converters build into all versions of Microsoft Access 2000/Excel
2000 (and Asian versions of Access 97 andd Excel 97). I have also seen a
couple on the web.

In any case, what are you wanting to see covered, and where?

michka