Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Mark Davis ☕
Popping up a level.

ICU (and some other libraries) have heuristic encoding detection, that will
take a sequence of bytes and come up with a likely encoding id.


Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken  wrote:

>
>
> > Suppose that these hex bytes:
> >
> >   C3 83 C2 B1
> >
> > show up in a message and the message contains no hint what its encoding
> is.
> >
> > Perhaps it is 8859-1, in which case the message consists of four 1-byte
> > characters:
> >
> > C3 = Ã
> > 83 = the “no break here” character
> > C2 = Â
> > B1 = ±
> >
> > Perhaps it is UTF-8, in which case the message consists of two 2-byte
> > characters:
> >
> > C383 = 쎃
> > C2B1 = 슱
>
> Actually, that would be interpreting it as UTF-16, not as UTF-8. That
> can probably be quickly ruled out if the rest of the text is not obviously
> in UTF-16.
>
> Interpreted as UTF-8, it would be:
>
> C3 83 --> U+00C3 = Ã
> C2 B1 --> U+00B1 = ±
>
> More likely than the other two alternatives you cite.
>
> Of course, you also have to consider serial corruptions as a possibility.
>
> It could have started out as UTF-8 C3 B1 --> U+00F1 = ñ.
>
> Then the  got misinterpreted as Latin-1, and then re-misinterpreted
> as UTF-8 again.
>
> --Ken
>
>
>
>


RE: Ways to show Unicode contents on Windows?

2013-07-19 Thread Peter Constable
I'm sorry that Microsoft's approach to product servicing does not meet your 
expectations. It is what it is, however. The costs involved in servicing the XP 
code base (which was forked from all subsequent Windows versions in 2001, so 
effectively does date back to then) are greater than I think you realize. Also, 
while you would evidently appreciate seeing an optional update for Uniscribe 
show up in Windows Update, the vast majority of users would only be confused by 
that. While I wish we could provide what you'd like, you represent a tiny 
fraction of all customers. 


Peter

-Original Message-
From: Eli Zaretskii [mailto:e...@gnu.org] 
Sent: Friday, July 19, 2013 11:29 AM
To: Peter Constable
Cc: nospam-ab...@ilyaz.org; unicode@unicode.org
Subject: Re: Ways to show Unicode contents on Windows?

> From: Peter Constable 
> CC: "nospam-ab...@ilyaz.org" , "unicode@unicode.org"
>   
> Date: Fri, 19 Jul 2013 15:15:51 +
> 
> The largest share of customers, by far, wouldn't want us to add new features 
> to XP since that would entail risks of new bugs, application compatibility 
> regressions, and frequent need to retrain users. If Ford or BMW were 
> continuously retrofitting design changes to vehicles in use, the mechanics, 
> parts dealers etc. would have headaches keeping up.

I'm sorry, but your analogy is broken.  OS updates are installed by myself, so 
no dealers, let alone mechanics, are involved, and no spare parts are anywhere 
in sight.  This is software; any analogy with hardware is almost always 
fundamentally wrong in any number of levels.

And I did mention that these upgrades could have been offered as "optional", so 
only those who really need them would install them (since "optional" updates 
are not automatically installed when the user chooses "express" installation, 
something 99.99% of users and the automatic update installation do).

> You shouldn't expect Unicode 6.2 support in an Android phone from 2008; even 
> less so, from a Windows XP system from 2001.

XP SP3 is not a 2001 system.  It was released in May 2008.  It upgraded many 
parts of Windows, including Internet Explorer.  I don't see why it couldn't 
upgrade Uniscribe.









Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Richard Wordingham
Peter Constable  wrote:
> Behalf Of Ilya Zakharevich wrote:

> > Why would one NEED to upgrade the OS to use Old Italic?

> You can't expect an OS like Windows XP to support Old Italic
> characters that weren't even defined in Unicode at the time it
> shipped.

That actually came as a great surprise to me.  I once naively thought
that all that had to be done was to update the version of the Unicode
Character Database (UCD) that the system was using, and then only new
*properties* should be causing major trouble.  Now scripts needing
reordering have their own problems, but that sort of problem is what
SIL developed Graphite for.  (I fear the case for Microsoft Office to
support Graphite is steadily reducing.)

The problem with changes to the UCD arises partly because enough
developers prefer speed and compactness to flexibility.

> That said, it turns out that a given version of Windows does support
> later-encoded characters such as Old Italic that have no special
> requirements fairly well -- provided you have a font and format your
> content with that font.

Are you sure this tolerance isn't by design?

> It is the case of simple rendering.  Given a font, and a keyboard
> layout (both doable in user-land), it should “just work”.  Or I am
> missing something?

The biggest thing you're missing is too much cleverness, and the second
is centralisation.

Word switches keyboard at the very least as you step through text,
which in simple cases is quite helpful.  Also, Office has (at least)
three current fonts - one for simple scripts, one for complex scripts,
and one for CJK scripts.  This in itself can cause problems with new
scripts - I have a fair bit of Tai Tham text in Open Document format
that has the wrong size because LibreOffice hesitantly changed the
script's classification from simple to complex.

The centralisation issue is that Indic rearrangement and selection of
Arabic and Syria contextual forms seemed obvious things to abstract
away from fonts and handle centrally. Consequently, text is split by
script and each script run handled separately.

Combining the two, we can certainly have Word XP asking whether a
font supports a script, and refusing to use it for the script if it
doesn't declare it does. I had to fiddle the OS/2 table of a Tai Tham
hack font (Lannaworld) to be able to use it.  The font maps Latin and
Thai characters to Tai Tham glyphs, but when I downloaded the font it
didn't declare support for the 'Basic Latin' character range or the
'Latin-1' encoding.  To get the font to work, I not only had to dodge
the constraints on Thai character sequences, I also had to change the
OS/2 table to declare that the font supported the Latin range and
encoding.
 
I still don't think we've got to the bottom of Doug's PUA problem.  For
all I know, he may have been violating the agreement he made with
Microsoft for the use of the PUA.  I'm not aware of Microsoft
publishing a consolidated statement of this agreement, but I've a
feeling some characters are reserved for symbol fonts and yet others are
reserved for Thai glyphs.  Its also conceivable that he trespassed on
the PUA assignments decreed by China for Tibetan.

Richard.




Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Richard Wordingham
On Fri, 19 Jul 2013 23:35:32 +0300
Eli Zaretskii  wrote:
> From: Peter Constable 

> IOW, the assertion that one cannot expect an OS shipped in
> 2001 to support scripts that didn't exist at that time is simply
> false.  There's no technical problem here, only a managerial decision.

If the Wikipedia article on Uniscribe is correct, the latest Uniscribe
would have been in use, at least with Office, on Windows XP by anyone
who had Office 2010.

>> Also, while you would evidently appreciate seeing an optional
>> update for Uniscribe show up in Windows Update, the vast majority
>> of users would only be confused by that.

> How can a newer and better text shaping engine possibly confuse users?

However, this version, Uniscribe 1.626.7600.20602, seems to have
been one that managed to refuse to rendering scripts later than Unicode
5.1.  Presumably this loss should only have been apparent when using
Office.  This was corrected in Windows 7 SP1, though I don't know what
was done for Windows XP users of Windows 2010.

Microsoft did not see fit to announce this fix as one of the benefits
of Windows 7 SP1, so I presume that Microsoft thought that too many
people would simply not understand 'shaping engine' or would not care
about the change.

Richard.



RE: Ways to show Unicode contents on Windows?

2013-07-19 Thread Peter Constable
Every Unicode code point will have some default behaviour in any text process 
on Windows. If those default behaviours happen to fit the character in 
question, then you should get the behaviour you want. But we don't service 
Windows for each UCD update. Also, not every text process relies solely on UCD 
data.


Peter

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Richard Wordingham
Sent: Friday, July 19, 2013 1:21 PM
To: unicode@unicode.org
Subject: Re: Ways to show Unicode contents on Windows?

Peter Constable  wrote:
> Behalf Of Ilya Zakharevich wrote:

> > Why would one NEED to upgrade the OS to use Old Italic?

> You can't expect an OS like Windows XP to support Old Italic 
> characters that weren't even defined in Unicode at the time it 
> shipped.

That actually came as a great surprise to me.  I once naively thought that all 
that had to be done was to update the version of the Unicode Character Database 
(UCD) that the system was using, and then only new
*properties* should be causing major trouble.  Now scripts needing reordering 
have their own problems, but that sort of problem is what SIL developed 
Graphite for.  (I fear the case for Microsoft Office to support Graphite is 
steadily reducing.)

The problem with changes to the UCD arises partly because enough developers 
prefer speed and compactness to flexibility.

> That said, it turns out that a given version of Windows does support 
> later-encoded characters such as Old Italic that have no special 
> requirements fairly well -- provided you have a font and format your 
> content with that font.

Are you sure this tolerance isn't by design?

> It is the case of simple rendering.  Given a font, and a keyboard 
> layout (both doable in user-land), it should “just work”.  Or I am 
> missing something?

The biggest thing you're missing is too much cleverness, and the second is 
centralisation.

Word switches keyboard at the very least as you step through text, which in 
simple cases is quite helpful.  Also, Office has (at least) three current fonts 
- one for simple scripts, one for complex scripts, and one for CJK scripts.  
This in itself can cause problems with new scripts - I have a fair bit of Tai 
Tham text in Open Document format that has the wrong size because LibreOffice 
hesitantly changed the script's classification from simple to complex.

The centralisation issue is that Indic rearrangement and selection of Arabic 
and Syria contextual forms seemed obvious things to abstract away from fonts 
and handle centrally. Consequently, text is split by
script and each script run handled separately.

Combining the two, we can certainly have Word XP asking whether a font supports 
a script, and refusing to use it for the script if it doesn't declare it does. 
I had to fiddle the OS/2 table of a Tai Tham hack font (Lannaworld) to be able 
to use it.  The font maps Latin and Thai characters to Tai Tham glyphs, but 
when I downloaded the font it didn't declare support for the 'Basic Latin' 
character range or the 'Latin-1' encoding.  To get the font to work, I not only 
had to dodge the constraints on Thai character sequences, I also had to change 
the
OS/2 table to declare that the font supported the Latin range and encoding.
 
I still don't think we've got to the bottom of Doug's PUA problem.  For all I 
know, he may have been violating the agreement he made with Microsoft for the 
use of the PUA.  I'm not aware of Microsoft publishing a consolidated statement 
of this agreement, but I've a feeling some characters are reserved for symbol 
fonts and yet others are reserved for Thai glyphs.  Its also conceivable that 
he trespassed on the PUA assignments decreed by China for Tibetan.

Richard.









Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Peter Edberg

On Jul 19, 2013, at 12:42 PM, Mark Davis ☕  wrote:

> Popping up a level.
> 
> ICU (and some other libraries) have heuristic encoding detection, that will 
> take a sequence of bytes and come up with a likely encoding id.

However, the ICU encoding detection typically requires more than 4 bytes 
(usually at least 10 characters worth of bytes) in order to make a reasonable 
guess.

- Peter E

> 
> 
> Mark
> 
> — Il meglio è l’inimico del bene —
> 
> 
> On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken  wrote:
> 
> 
> > Suppose that these hex bytes:
> >
> >   C3 83 C2 B1
> >
> > show up in a message and the message contains no hint what its encoding is.
> >
> > Perhaps it is 8859-1, in which case the message consists of four 1-byte
> > characters:
> >
> > C3 = Ã
> > 83 = the “no break here” character
> > C2 = Â
> > B1 = ±
> >
> > Perhaps it is UTF-8, in which case the message consists of two 2-byte
> > characters:
> >
> > C383 = 쎃
> > C2B1 = 슱
> 
> Actually, that would be interpreting it as UTF-16, not as UTF-8. That
> can probably be quickly ruled out if the rest of the text is not obviously
> in UTF-16.
> 
> Interpreted as UTF-8, it would be:
> 
> C3 83 --> U+00C3 = Ã
> C2 B1 --> U+00B1 = ±
> 
> More likely than the other two alternatives you cite.
> 
> Of course, you also have to consider serial corruptions as a possibility.
> 
> It could have started out as UTF-8 C3 B1 --> U+00F1 = ñ.
> 
> Then the  got misinterpreted as Latin-1, and then re-misinterpreted
> as UTF-8 again.
> 
> --Ken
> 
> 
> 
> 



RE: Ways to show Unicode contents on Windows?

2013-07-19 Thread Peter Constable
Not everything that is technically possible makes good sense. My comments 
clearly were not framed solely in terms of what is technically possible. 


Peter

-Original Message-
From: Eli Zaretskii [mailto:e...@gnu.org] 
Sent: Friday, July 19, 2013 1:36 PM
To: Peter Constable
Cc: nospam-ab...@ilyaz.org; unicode@unicode.org
Subject: Re: Ways to show Unicode contents on Windows?

> From: Peter Constable 
> CC: "nospam-ab...@ilyaz.org" ,
> "unicode@unicode.org"
>   
> Date: Fri, 19 Jul 2013 19:49:10 +
> 
> I'm sorry that Microsoft's approach to product servicing does not meet your 
> expectations. It is what it is, however.

That's not the issue here.  The issue here is that such updates _could_ be 
provided without requiring users to install a newer version of the OS.  IOW, 
the assertion that one cannot expect an OS shipped in
2001 to support scripts that didn't exist at that time is simply false.  
There's no technical problem here, only a managerial decision.

> Also, while you would evidently appreciate seeing an optional update for 
> Uniscribe show up in Windows Update, the vast majority of users would only be 
> confused by that.

How can a newer and better text shaping engine possibly confuse users?









Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Eli Zaretskii
> From: Peter Constable 
> CC: "nospam-ab...@ilyaz.org" , "unicode@unicode.org"
>   
> Date: Fri, 19 Jul 2013 15:15:51 +
> 
> The largest share of customers, by far, wouldn't want us to add new features 
> to XP since that would entail risks of new bugs, application compatibility 
> regressions, and frequent need to retrain users. If Ford or BMW were 
> continuously retrofitting design changes to vehicles in use, the mechanics, 
> parts dealers etc. would have headaches keeping up.

I'm sorry, but your analogy is broken.  OS updates are installed by
myself, so no dealers, let alone mechanics, are involved, and no spare
parts are anywhere in sight.  This is software; any analogy with
hardware is almost always fundamentally wrong in any number of levels.

And I did mention that these upgrades could have been offered as
"optional", so only those who really need them would install them
(since "optional" updates are not automatically installed when the
user chooses "express" installation, something 99.99% of users and the
automatic update installation do).

> You shouldn't expect Unicode 6.2 support in an Android phone from 2008; even 
> less so, from a Windows XP system from 2001.

XP SP3 is not a 2001 system.  It was released in May 2008.  It
upgraded many parts of Windows, including Internet Explorer.  I don't
see why it couldn't upgrade Uniscribe.



Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Karl Williamson

On 07/19/2013 11:51 AM, Costello, Roger L. wrote:

Hi Folks,

Suppose that these hex bytes:

C3 83 C2 B1

show up in a message and the message contains no hint what its encoding is.

Perhaps it is 8859-1, in which case the message consists of four 1-byte 
characters:

C3 = Ã
83 = the “no break here” character
C2 = Â
B1 = ±

Perhaps it is UTF-8, in which case the message consists of two 2-byte 
characters:

C383 = 쎃
C2B1 = 슱



That's not how UTF-8 works.  Instead in UTF-8 it would be:

 C3 83 = LATIN CAPITAL LETTER A WITH TILDE
 C2 B1 = PLUS-MINUS SIGN

It's unlikely that any other encoding will pass a UTF-8 validity test 
for inputs longer than just a few bytes.  So you can rule-in or rule-out 
UTF-8 fairly easily.  You can also look for BOMs to get UTF-16 and UTF-32.


After that, there are various heuristics that can be applied, and people 
have written things that attempt to guess encodings.  An example from 
Perl is

http://search.cpan.org/~dankogai/Encode-2.51/lib/Encode/Guess.pm
but it requires a list of possible encodings that it experiments with.


Or, perhaps it is some other encoding.

What does one do in such a situation?

/Roger







Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Eli Zaretskii
> From: Peter Constable 
> CC: "nospam-ab...@ilyaz.org" ,
> "unicode@unicode.org"
>   
> Date: Fri, 19 Jul 2013 19:49:10 +
> 
> I'm sorry that Microsoft's approach to product servicing does not meet your 
> expectations. It is what it is, however.

That's not the issue here.  The issue here is that such updates
_could_ be provided without requiring users to install a newer version
of the OS.  IOW, the assertion that one cannot expect an OS shipped in
2001 to support scripts that didn't exist at that time is simply
false.  There's no technical problem here, only a managerial decision.

> Also, while you would evidently appreciate seeing an optional update for 
> Uniscribe show up in Windows Update, the vast majority of users would only be 
> confused by that.

How can a newer and better text shaping engine possibly confuse users?



RE: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Whistler, Ken


> Suppose that these hex bytes:
> 
>   C3 83 C2 B1
> 
> show up in a message and the message contains no hint what its encoding is.
> 
> Perhaps it is 8859-1, in which case the message consists of four 1-byte
> characters:
> 
> C3 = Ã
> 83 = the “no break here” character
> C2 = Â
> B1 = ±
> 
> Perhaps it is UTF-8, in which case the message consists of two 2-byte
> characters:
> 
> C383 = 쎃
> C2B1 = 슱

Actually, that would be interpreting it as UTF-16, not as UTF-8. That
can probably be quickly ruled out if the rest of the text is not obviously
in UTF-16.

Interpreted as UTF-8, it would be:

C3 83 --> U+00C3 = Ã
C2 B1 --> U+00B1 = ±

More likely than the other two alternatives you cite.

Of course, you also have to consider serial corruptions as a possibility.

It could have started out as UTF-8 C3 B1 --> U+00F1 = ñ.

Then the  got misinterpreted as Latin-1, and then re-misinterpreted
as UTF-8 again.

--Ken





What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Costello, Roger L.
Hi Folks,

Suppose that these hex bytes:

C3 83 C2 B1 

show up in a message and the message contains no hint what its encoding is. 

Perhaps it is 8859-1, in which case the message consists of four 1-byte 
characters: 

C3 = Ã
83 = the “no break here” character
C2 = Â
B1 = ±

Perhaps it is UTF-8, in which case the message consists of two 2-byte 
characters:

C383 = 쎃
C2B1 = 슱

Or, perhaps it is some other encoding.

What does one do in such a situation?

/Roger




RE: Ways to show Unicode contents on Windows?

2013-07-19 Thread Doug Ewell
The word "support" has different meanings in the software industry. In
some contexts, it has a very restrictive meaning: "application X does
not support OS version Y" (or vice versa) doesn't just mean the
respective vendors don't offer technical support. It usually means the
application will not run at all under that OS, perhaps enforced by the
installer.

In the case of scripts or characters, I would expect "OS version X does
not support script Y" to mean there is no *out-of-the-box* support for
that script -- no fonts, keyboards, spell checkers, special rendering
engine support, etc. that ship with the OS or are available as a vendor
update. But I would still expect to be able to install my own fonts or
keyboard drivers, or copy and paste text in that script into an app or
document, and have the system apply some default behavior and not
actively prevent its use.

This is a general comment, and not specific to what level of "support"
any particular version of Windows or other OS provides for any
particular script, except for the Word/PUA situation I mentioned
earlier.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­





RE: Ways to show Unicode contents on Windows?

2013-07-19 Thread Peter Constable
The largest share of customers, by far, wouldn't want us to add new features to 
XP since that would entail risks of new bugs, application compatibility 
regressions, and frequent need to retrain users. If Ford or BMW were 
continuously retrofitting design changes to vehicles in use, the mechanics, 
parts dealers etc. would have headaches keeping up. Moreover, if those 
companies spent their time and energy that way adding little extras, we 
wouldn't see as much progress in improvements to reliability or fuel 
efficiency; I suspect most people are glad they don't operate that way. It's 
not that I wouldn't want you to enjoy great language capabilities on XP or 
whatever other old system you might have; it's just a matter of being 
realistic. You shouldn't expect Unicode 6.2 support in an Android phone from 
2008; even less so, from a Windows XP system from 2001.


Peter

Sent from my Windows Phone

From: Eli Zaretskii
Sent: ‎7/‎18/‎2013 11:47 PM
To: Peter Constable
Cc: nospam-ab...@ilyaz.org; 
unicode@unicode.org
Subject: Re: Ways to show Unicode contents on Windows?

> From: Peter Constable 
> Date: Fri, 19 Jul 2013 03:19:44 +
>
> You can't expect an OS like Windows XP to support Old Italic characters that 
> weren't even defined in Unicode at the time it shipped.

In all fairness, I think you forget the regular OS updates.  I have XP
SP3 installed on one of my machines, and it gets updates until this
very date (the last one was just a week ago).  One of those updates
could bring an upgrade of Uniscribe and whatever else is needed to
support newly introduced scripts.  No one said that these updates are
only for fixing "security holes" or other urgent problems.  Moreover,
the updates are subdivided into "important" and "optional", so the
script support could be in "optional" area, if that's important.

So I guess it's just a matter of managerial decision somewhere,
whether to bring this new support to existing systems or request users
to upgrade to the next OS version.  I see no technical reasons here.





RE: Ways to show Unicode contents on Windows?

2013-07-19 Thread Peter Constable
"... it is becoming more difficult to develop solutions for lesser used 
languages."

Well, starting in Windows 8 the languages you can configure for input number in 
the thousands, being limited only by text display capabilities. E.g., in 
Windows 8 you could configure Dyirbal as one of your languages. That's a huge 
step forward for lesser used / lesser known languages.

Peter

Sent from my Windows Phone

From: Andrew Cunningham
Sent: ‎7/‎19/‎2013 5:21 AM
To: Richard Wordingham
Cc: unicode@unicode.org
Subject: Re: Ways to show Unicode contents on Windows?


Although writing an IME from scratch is beyond the skill set of a few of us.

Although there are text services framework table based IMEs. Although I did 
here a romour that support for those may disappear. Not sure if that is true or 
not.

But since Windows 8, it has become even more difficult to track what is 
happening in terms of input, esp since there are more input frameworks than 
there used to be.

One of the reasons I prefer using non-Microsoft tools for complex input 
requirements.

Microsoft typography team has done same very good work. But Microsoft is so 
large, things are becoming fragmented.

Interesting tools like locale builder were never maintained.

And it is becoming more difficult to develop solutions for lesser used 
languages.

It is the nature of the beast, not just an issue with Microsoft and Windows 8, 
but with internationalisatin support in many large projects.

Andrew

On 19/07/2013 5:47 PM, "Richard Wordingham" 
mailto:richard.wording...@ntlworld.com>> wrote:
On Thu, 18 Jul 2013 17:11:45 -0700
Ilya Zakharevich mailto:nospam-ab...@ilyaz.org>> wrote:

> Just in case: do you realize that out-of-BMP must be specified via
> LIGATURES section?

Yes, for 'character' read UTF-16 code element.  Even worse, you can't
use dead keys outside the BMP, which prevents one using MSKLC for
typing in natural language in cuneiform orthography.  (Plain text
Egyptian is no more supported than is plain text calculus.)  However,
I recall that one can use a simple IME instead.

Richard.




Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Andrew Cunningham
Although writing an IME from scratch is beyond the skill set of a few of us.

Although there are text services framework table based IMEs. Although I did
here a romour that support for those may disappear. Not sure if that is
true or not.

But since Windows 8, it has become even more difficult to track what is
happening in terms of input, esp since there are more input frameworks than
there used to be.

One of the reasons I prefer using non-Microsoft tools for complex input
requirements.

Microsoft typography team has done same very good work. But Microsoft is so
large, things are becoming fragmented.

Interesting tools like locale builder were never maintained.

And it is becoming more difficult to develop solutions for lesser used
languages.

It is the nature of the beast, not just an issue with Microsoft and Windows
8, but with internationalisatin support in many large projects.

Andrew
On 19/07/2013 5:47 PM, "Richard Wordingham" 
wrote:

> On Thu, 18 Jul 2013 17:11:45 -0700
> Ilya Zakharevich  wrote:
>
> > Just in case: do you realize that out-of-BMP must be specified via
> > LIGATURES section?
>
> Yes, for 'character' read UTF-16 code element.  Even worse, you can't
> use dead keys outside the BMP, which prevents one using MSKLC for
> typing in natural language in cuneiform orthography.  (Plain text
> Egyptian is no more supported than is plain text calculus.)  However,
> I recall that one can use a simple IME instead.
>
> Richard.
>
>
>


Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Stephan Stiller

Hi Jörg,

Thanks for the info!


U+25C7 WHITE DIAMOND
is the best choice

I'm with you in that for now I'll go with
⟨◻ (U+25FB), ◇ (U+25C7)⟩
as the pair of choice, pending further decisions; see also what I'm 
writing further down. Or objections from experts stating that the symbol 
properties are not suited for a unary mathematical prefix operator, 
whatever an "operator" is ("unary operator" is (at least) programming 
terminology, found when operator precedence is discussed, so it seems 
appropriate).



followed by
U+27E1 WHITE CONCAVE-SIDED DIAMOND • never (modal operator)
I'm assuming the charts are reliable in that this in some community (eg: 
temporal modal logic) denotes a "never" modality. If so, it definitely 
can't be used in place of a (straight-shaped, ordinary, 90°/square) 
diamond. The abstract shapes are just clearly different in this case. So 
– I wouldn't pick this one. Again: to me, ⬦ (U+2B26) is the next 
contender, but that's admittedly impressionistic.


Just because you're mentioning this:

I couldn't locate WHITE DIAMOND WITH LEFTWARDS TICK in UNicode.
"WHITE DIAMOND WITH RIGHTWARDS TICK" isn't in Unicode either, but then I 
don't know whether you'd need such a character.



(U+2662 WHITE DIAMOND SUIT would also look OK

I thought so too at first, but you are right about this:

but I think this is symbol abuse

and I also think the lozenge-shape works against this character.


For the properties of mathematical symbols, see also
http://www.unicode.org/reports/tr25/
---but I have to admit that the report does not answer the specific 
question posed here.
Table 2.5 there (pp. 19-21) is actually quite helpful. From it I would 
judge that ◇/U+25C7 is the only sensible choice for the diamond. 
Depending on whether you want to match area or height as closely as 
possible, it also seems like you really want to pair it up with either 
◻/U+25FB or □/U+25A1; it's not clear this is a precise match, but this 
is probably the best reasoning. I'd go with the former [my email program 
displays the latter as a smaller symbol, but the former is supposed to 
be smaller than the latter].


But this one

Maybe this mapping table is more useful (but harder to read):
http://www.w3.org/Math/characters/unicode.xml

I don't find helpful in resolving this issue.

Now that I'm reading this
I'd consider U+22C4 DIAMOND OPERATOR as wrong because it is used as a 
binary operator
I'm thinking that "binary infix symbol" and "unary prefix symbol" might 
be appropriate terminology too.


which has a very different spacing than the unary modal operator 
needed here
The idea that a math symbol has inherent spacing is popular in the LaTeX 
community, though I don't know where else this view exists. There it's 
the math character class that a symbol is assigned to (⋄/U+22C4 is a 
"binary relation" symbol). Of course the math character class can be 
modified ad-hoc very easily, eg you can change it from "binary relation" 
to "relation symbol" by writing "\mathrel{/whatever/}". (Now if this 
doesn't make for messed-up terminology. And the TeXbook seems to use 
"binary operation", plus "operator" by itself has its own association in 
the TeX world, namely with summation and product symbols and other such 
quantifying prefix accumulators.) It's my opinion that it might not be 
good to think of such TeX math character classes as fixed anyways.


Stephan



Aw: Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Jörg Knappen

I think,

 

U+25C7 WHITE DIAMOND

is the best choice, followed by

U+27E1 WHITE CONCAVE-SIDED DIAMOND • never (modal operator)

 

The latter has a more fancy shape and might not be the one the reader expects. As a plus, it comes also with versions having right and left ticks, needed in some extensions of modal logic. I couldn't locate WHITE DIAMOND WITH LEFTWARDS TICK in UNicode.

 

(U+2662 WHITE DIAMOND SUIT would also look OK, but I think this is symbol abuse. Can be used as a fallback when the font of choice has this one, but none of the two above.)

 

For the properties of mathematical symbols, see also

http://www.unicode.org/reports/tr25/


---but I have to admit that the report does not answer the specific question posed here.

 

Maybe this mapping table is more useful (but harder to read):

http://www.w3.org/Math/characters/unicode.xml

 

--Jörg Knappen

 

P.S. I'd consider U+22C4 DIAMOND OPERATOR as wrong because it is used as a binary operator which has a very different

spacing than the unary modal operator needed here.

 


Gesendet: Freitag, 19. Juli 2013 um 09:43 Uhr
Von: "Stephan Stiller" 
An: "Unicode Public" 
Betreff: Re: symbols/codepoints for necessity and possibility in modal logic


 

What is wrong with using DIAMOND OPERATOR?

"wrong" is strong wording and goes beyond what I suggested or implied, but it's not clear to a user of Unicode that it's the right fit either. There are a couple of indicators factoring in:


	The charts mention modal logic in conjunction with ◻ (U+25FB) and ⟠ (U+27E0) but not with ⋄ (U+22C4).
	The glyph in the code charts is tiny (and that of Cambria Math is tiny as well). Typographically you see various things (a lozenge, fallback to letter-M) in esp older books, but it feels like it's meant to be an orthogonal diamond of perhaps slightly less area than the box but descending a little above and below the box, which is somewhat taller than x-height. The book by {Blackburn, de Rijke, Venema} has glyphs that look right. This is more than a guess: it makes sense if they have similar visual weight, as they are – literally – defined to be duals of one another; but whether you can make them geometrically congruent symbols of equal area I haven't tested (this might have the diamond ascend too far).
	The vague notion of "operator" (a word with different meanings in math, from  logical relation  to  [non-logical/non-relational] mapping of type A×A→A or perhaps A×A→B  to  (linear) map (between say vector spaces) in linear algebra) in this context (in the code charts) seems to refer to something like my middle meaning, which is likely to use a smaller symbol around x-height in placement and dimensions.
	The glyph of ⬦ (U+2B26) seems to have a more appropriate name, but in the charts I like ◇ U+25C7. The differently sized square-like symbols are hard to semantically tell apart in/from the charts anyway.
	These symbols are the first two visually distinct ones you define in modal logic, so they're well-known and standardized in meaning for anyone who had had contact with the field. It's surprising they're not explicitly named in the charts. (There's stuff like the outdated horseshoe for logical implication popping up in the relevant books, but that is a leftover or outdated logic notation in general.) So for box and diamond it's quite reasonable to be expecting a standard math font to provide them just right out of the box; for whatever commonly used box-like symbols in math there are, one would assume that there are corresponding codepoints; otherwise you'd have to choose a different font.



Stephan
 








Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Stephan Stiller



Why not contact the relevant publishers and find out what they are using?
"Why not contact the relevant governments and find out what they're 
using in order to solve /_*all*_/ encoding issues for /_*all*_/ 
languages and writing systems within a day?" :-)


Publishers use metal type (or various methods of reproductions of what 
was originally that) or LaTeX, neither of which have an inherent notion 
of a codepoint. There are efforts to map between TeX output and Unicode, 
but they require that someone either figure out what's best (and so far 
it seems like the answer is unclear) or create truth by declaration.


That said, let me contact the most important person there and think of this

you could ask for some annotations

as a dependent step further down the action graph.

Stephan



Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Asmus Freytag
Unicode cannot be the arbiter of mathematical (or other) notation, but, 
within limits, you could ask for some annotations if this would help 
ensure that there's some uniformity in how people pick symbols for 
certain purposes.


Why not contact the relevant publishers and find out what they are using?

A./

On 7/19/2013 12:43 AM, Stephan Stiller wrote:



What is wrong with using DIAMOND OPERATOR?
"wrong" is strong wording and goes beyond what I suggested or implied, 
but it's not clear to a user of Unicode that it's the right fit 
either. There are a couple of indicators factoring in:


  * The charts mention modal logic in conjunction with ◻ (U+25FB) and
⟠ (U+27E0) but not with ⋄ (U+22C4).
  * The glyph in the code charts is tiny (and that of Cambria Math is
tiny as well). Typographically you see various things (a lozenge,
fallback to letter-M) in esp older books, but it feels like it's
meant to be an orthogonal diamond of perhaps slightly less area
than the box but descending a little above and below the box,
which is somewhat taller than x-height. The book by {Blackburn, de
Rijke, Venema} has glyphs that look right. This is more than a
guess: it makes sense if they have similar visual weight, as they
are – literally – defined to be duals of one another; but whether
you can make them geometrically congruent symbols of equal area I
haven't tested (this might have the diamond ascend too far).
  * The vague notion of "operator" (a word with different meanings in
math, from /logical relation/  to /[non-logical/non-relational]
mapping of type A×A→A or perhaps A×A→B/  to /(linear) map (between
say vector spaces) in linear algebra/) in this context (in the
code charts) seems to refer to something like my middle meaning,
which is likely to use a smaller symbol around x-height in
placement and dimensions.
  * The glyph of ⬦ (U+2B26) seems to have a more appropriate name, but
in the charts I like ◇ U+25C7. The differently sized square-like
symbols are hard to semantically tell apart in/from the charts anyway.
  * These symbols are the first two visually distinct ones you define
in modal logic, so they're well-known and standardized in meaning
for anyone who had had contact with the field. It's surprising
they're not explicitly named in the charts. (There's stuff like
the outdated horseshoe for logical implication popping up in the
relevant books, but that is a leftover or outdated logic notation
in general.) So for box and diamond it's quite reasonable to be
expecting a standard math font to provide them just right out of
the box; for whatever commonly used box-like symbols in math there
are, one would assume that there are corresponding codepoints;
otherwise you'd have to choose a different font.


Stephan





Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Stephan Stiller



What is wrong with using DIAMOND OPERATOR?
"wrong" is strong wording and goes beyond what I suggested or implied, 
but it's not clear to a user of Unicode that it's the right fit either. 
There are a couple of indicators factoring in:


 * The charts mention modal logic in conjunction with ◻ (U+25FB) and ⟠
   (U+27E0) but not with ⋄ (U+22C4).
 * The glyph in the code charts is tiny (and that of Cambria Math is
   tiny as well). Typographically you see various things (a lozenge,
   fallback to letter-M) in esp older books, but it feels like it's
   meant to be an orthogonal diamond of perhaps slightly less area than
   the box but descending a little above and below the box, which is
   somewhat taller than x-height. The book by {Blackburn, de Rijke,
   Venema} has glyphs that look right. This is more than a guess: it
   makes sense if they have similar visual weight, as they are –
   literally – defined to be duals of one another; but whether you can
   make them geometrically congruent symbols of equal area I haven't
   tested (this might have the diamond ascend too far).
 * The vague notion of "operator" (a word with different meanings in
   math, from /logical relation/  to /[non-logical/non-relational]
   mapping of type A×A→A or perhaps A×A→B/  to /(linear) map (between
   say vector spaces) in linear algebra/) in this context (in the code
   charts) seems to refer to something like my middle meaning, which is
   likely to use a smaller symbol around x-height in placement and
   dimensions.
 * The glyph of ⬦ (U+2B26) seems to have a more appropriate name, but
   in the charts I like ◇ U+25C7. The differently sized square-like
   symbols are hard to semantically tell apart in/from the charts anyway.
 * These symbols are the first two visually distinct ones you define in
   modal logic, so they're well-known and standardized in meaning for
   anyone who had had contact with the field. It's surprising they're
   not explicitly named in the charts. (There's stuff like the outdated
   horseshoe for logical implication popping up in the relevant books,
   but that is a leftover or outdated logic notation in general.) So
   for box and diamond it's quite reasonable to be expecting a standard
   math font to provide them just right out of the box; for whatever
   commonly used box-like symbols in math there are, one would assume
   that there are corresponding codepoints; otherwise you'd have to
   choose a different font.


Stephan



Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Richard Wordingham
On Thu, 18 Jul 2013 17:11:45 -0700
Ilya Zakharevich  wrote:

> Just in case: do you realize that out-of-BMP must be specified via
> LIGATURES section?

Yes, for 'character' read UTF-16 code element.  Even worse, you can't
use dead keys outside the BMP, which prevents one using MSKLC for
typing in natural language in cuneiform orthography.  (Plain text
Egyptian is no more supported than is plain text calculus.)  However,
I recall that one can use a simple IME instead.

Richard.




Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Eli Zaretskii
> From: Peter Constable 
> Date: Fri, 19 Jul 2013 03:19:44 +
> 
> You can't expect an OS like Windows XP to support Old Italic characters that 
> weren't even defined in Unicode at the time it shipped.

In all fairness, I think you forget the regular OS updates.  I have XP
SP3 installed on one of my machines, and it gets updates until this
very date (the last one was just a week ago).  One of those updates
could bring an upgrade of Uniscribe and whatever else is needed to
support newly introduced scripts.  No one said that these updates are
only for fixing "security holes" or other urgent problems.  Moreover,
the updates are subdivided into "important" and "optional", so the
script support could be in "optional" area, if that's important.

So I guess it's just a matter of managerial decision somewhere,
whether to bring this new support to existing systems or request users
to upgrade to the next OS version.  I see no technical reasons here.



Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Asmus Freytag

What is wrong with using DIAMOND OPERATOR?

A./

On 7/18/2013 8:27 PM, Stephan Stiller wrote:

Hi all,

Modal logic uses a "box" and a "diamond" (this is how they're 
informally called) as operators (accepting one formula and returning 
another) to denote necessity and possibility, resp. Older texts might 
use the letters L and M (resp). Which Unicode codepoints do modal box 
and diamond correspond to?


According to the charts, it seems like the box is
◻ (U+25FB)
(is this definitive?), but what about the diamond? Unlike what one 
might glean from the charts, ⟠ (U+27E0) is afaiu /not/ normally used 
to denote possibility in the default† sense. Wiki's "List of logic 
symbols" article has something to say about this too, but I'm always 
cautious about information from there.


Stephan

† eg in the sense of "λ𝑥 . ¬◻¬𝑥" with ◻ as used in say the axiom
schema conventionally named *T* in modal logic