Re: Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates)

2015-10-05 Thread Philippe Verdy
2015-10-05 19:11 GMT+02:00 Ken Whistler :

> However, it would be reasonable (and permitted) for an API to actually
> report a default value for a surrogate code point (i.e., treating it more
> or less like the reserved code point U+50005 that Marcus mentioned).
>

Unassigned (reserved) code points, when followed by an assigned combining
mark would still be treated as starters of a combining sequence by default.

This is not (IMHO) desirable for lone surrogates that should better be
handled in isolation independantly of what follows them.

My opinion is that they should be treated like new line controls, so that
the combining mark after it will also be separated into a defective
combining sequence without any starter (e.g. 000A 0302 creates two
clusters, this should be the same for D800 0302. D800 will have no defined
glyph to render, but the glyph for U+FFFD may be displayed, or just a
".notdef" tofu box).

Now for break opportunities, those lone surrogates should not create a
newline or paragraph break opportunity, but they may create a word break
opportunity to allow their easy separation and selection by a double-click
on this tofu in an editor; they may even create a syllable break
opportunity before and after them to allow wrapping long lines there).
Those adaptations however are not described at all in annexes speaking
about text segmentations.

So those surrogates (which are permanently assigned) could have their own
code point properties more formally defined. In my opinion handling them
like U+ is much better than handling thme like U+50005, which should
stay reserved and handled as standard starters with default combining class
0.

Also those lone surrogates should be Bidi-neutral (imagine they occur in
the middle of some Arabic text, they should probably not change the
direction of the surrounding text and should not alter the embedding
context).


Re: Deleting Lone Surrogates

2015-10-05 Thread Richard Wordingham
On Mon, 5 Oct 2015 16:51:25 +0200
Philippe Verdy  wrote:

> 2015-10-05 13:50 GMT+02:00 Martin J. Dürst :
> 
> > In an editing tool (of which an editing interface is a part of), a
> > lone surrogate should just be removed! Apparently, that's what
> > happens in Richard's case, but only eventually.

> Not silently ! Even if this removal is required to go on editing,
> this must be notified to the user as it may occur in unedited parts
> of the file (and it may be the sign that the document is not fully
> plain text, so the user should not save the edited file)
> If this is caused by a quirk in the user input (defect of the input
> mode or keyboard layout), there should be a notification.

The lone surrogates (as I surmise) in this case are caused by the user
input being misinterpreted.  The sequence of strings delivered to a
program running X receiving the same sequence of keystrokes is U+1148F,
U+114C0, U+0008, U+114BF, and I have no reason to doubt that the
offending program is receiving the same sequence.  My working
hypothesis is that this is being simplified to U+1148F, U+D805,
U+114BF; the presence of U+D805 is a program error.  I can reproduce
the problem in a previously empty file.

Now, on Windows, old MS keyboards at least deliver supplementary
characters in a pair of WM_CHAR messages.  If one of these ligatures
were corrupted so that only the first of the messages was delivered, it
is not obvious to me how a program would readily detect the omission.
It would only become obvious when the start of the next *character* was
received.

Richard.



Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates)

2015-10-05 Thread Ken Whistler

Section 3.5, Properties, of the standard attempts to address this.

"Code point properties" are properties of the code points, per se, and
clearly do have all code points (U+..U+10) in their scope.
An example is the Surrogate code point property, which wouldn't
make much sense if it didn't apply to surrogate code points!

"Encoded character properties" are properties of the characters
themselves -- attributes like Ideographic or Numeric_Value. For 
completeness,

those are given *default* values for all reserved code points (and
for noncharacter and PUA code points). In principle, the scope should be
all Unicode scalar values: U+..U+D7FF, U+E000..U+10,
because it doesn't make much sense to talk about character properties
for code points that are ill-formed and which cannot ever actually
represent a character.

However, in practice, it is simplest to extend the *default* values of
encoded character properties to the surrogate code points, so that
in the cases where they occur in ill-formed text, APIs and
applications have some hope of doing something useful,
rather than just reacting exceptionally to featureless singularities
embedded in text.

Hence, the bullet in the text in the standard:

* For each encoded character property there is a mapping from every
code point to some value in the set of values associated with that property.

There is nothing in the standard, as I read it, that imposes a conformance
requirement on any process that would *require* it to interpret
an isolated surrogate code point and give it a particular property value.
However, it would be reasonable (and permitted) for an API to actually
report a default value for a surrogate code point (i.e., treating it more
or less like the reserved code point U+50005 that Marcus mentioned).
Such behavior in a character property API is likely to result in more
graceful behavior than simply throwing exceptions.

--Ken

On 10/4/2015 12:30 PM, Richard Wordingham wrote:

Do all Unicode character properties extend to all codepoints?  If not,
how does one tell which do and which don't? ...
Richard. 




Re: Deleting Lone Surrogates

2015-10-05 Thread Asmus Freytag (t)

On 10/5/2015 7:51 AM, Philippe Verdy wrote:
Not silently ! Even if this removal is required to go on editing, this 
must be notified to the user as it may occur in unedited parts of the 
file (and it may be the sign that the document is not fully plain 
text, so the user should not save the edited file)
If this is caused by a quirk in the user input (defect of the input 
mode or keyboard layout), there should be a notification.


As long as we are discussing, as Richard is, recommendations for 
implementers, I fully agree with Phillipe.


Manually editing surrogate corruptions might be something that could be 
relegated to an "expert mode", but automatic correction without user 
confirmation ("May we clean up your file?") would indeed be spooky and 
dangerous.


A./


Re: Deleting Lone Surrogates

2015-10-05 Thread Philippe Verdy
Not silently ! Even if this removal is required to go on editing, this must
be notified to the user as it may occur in unedited parts of the file (and
it may be the sign that the document is not fully plain text, so the user
should not save the edited file)
If this is caused by a quirk in the user input (defect of the input mode or
keyboard layout), there should be a notification.

But for a general purpose editor that allows editing files including binary
ones (e.g. Emacs), it is best to NOT drop those lone surrogates at all, and
effectively treat them in isolation for ALL purposes (the DELETE key should
not delete more than this lone surrogate (it may be necessary to adjst the
cursor position after the deletion if the editor does not support placing
the cursor in the middle of a combining sequence, but a LONE surrogate + a
combining character should still be treated as two separate clusters and
the cursor or selection should be placable between the lone surrogate and
the combining mark.)

Note that file formats that contain binary parts and plain text parts do
exist, e.g. media files that contain a final plain text section for
metadata or for some XML data signature : it is safe to edit that final
part in a text editor, provided that it does not silently change the
encoding of the binary part.

In summary, I do not like the idea of silently dropping lone surrogates in
editors. If the editor needs it because it cannot safely handle binary
parts, the notification will say to the user that he should not use that
editor and choose something else, or it will allow the user to select
another appropriate file encoding to edit the file safely. The user should
not save the file blindly as it will be corrupted silently. Doing otherwise
would be a security issue.

And this remark extends to all other protocols using plain text input ;
lone surrogates should not be dropped silently (unless explicitly requested
for exemple in a maintenance cleanup or repair) : it this lone surrogate
violates the further processing, the only safe option is to reject the
whole text and report the error if text data is required but missing.

2015-10-05 13:50 GMT+02:00 Martin J. Dürst :

> On 2015/10/05 04:30, Asmus Freytag (t) wrote:
>
>> On 10/4/2015 6:02 AM, Richard Wordingham wrote:
>>
>>> In the absence of a specific tailoring, is the combination of a lone
>>> surrogate and a combining mark a user-perceived character?  Does a lone
>>> surrogate constitute a user-perceived character?
>>>
>>
>> In an editing interface, a lone surrogate should be a user perceived
>> character,
>> as otherwise you won't be able to manually delete it. Markus suggests
>> that it be
>> treated like an unassigned code point.
>>
>
> In an editing tool (of which an editing interface is a part of), a lone
> surrogate should just be removed! Apparently, that's what happens in
> Richard's case, but only eventually.
>
> Regards,   Martin.
>


Re: Deleting Lone Surrogates

2015-10-05 Thread Martin J. Dürst

On 2015/10/05 04:30, Asmus Freytag (t) wrote:

On 10/4/2015 6:02 AM, Richard Wordingham wrote:

In the absence of a specific tailoring, is the combination of a lone
surrogate and a combining mark a user-perceived character?  Does a lone
surrogate constitute a user-perceived character?


In an editing interface, a lone surrogate should be a user perceived character,
as otherwise you won't be able to manually delete it. Markus suggests that it be
treated like an unassigned code point.


In an editing tool (of which an editing interface is a part of), a lone 
surrogate should just be removed! Apparently, that's what happens in 
Richard's case, but only eventually.


Regards,   Martin.


Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 16:57:15 -0700
"Asmus Freytag (t)"  wrote:

> On 10/4/2015 4:14 PM, Richard Wordingham wrote:
> respect to what to erase or undo.

>>> For sequences that belong to a given language, you can pick the
>>> behavior that makes most sense in them, but for lone surrogates, by
>>> definition you are dealing with broken text that doesn't follow any
>>> conventions.
 
>> Who's 'you'?  Customisation is frequently not available.  In fact, I
>> don't recall seeing it on offer.

> The UI developer.
 
> And there's nothing Unicode can do about lack of customizability.

Actually, there is.  I believe suggestions and recommendations in the
technical reports are quite influential.

Richard. 


Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 15:34:13 -0700
"Asmus Freytag (t)"  wrote:

> On 10/4/2015 2:35 PM, Richard Wordingham wrote:

>> I'd much prefer to be able to delete the first character of a
>> grapheme
>> cluster.  It's annoying to have to retype 4 characters because one's
>> mistyped the first of the 4 characters in a grapheme cluster.
>> Removing the restriction would be much more useful.

> That makes sense for common typos, less so, for uncommon (hopefully)
> data corruption.

Allowing access within the cluster is generally useful.  Providing more
access just makes it easier to repair things.  One problem is that
there isn't a 'suspend shaping' option to allow one to see what one is
doing.  This matters when canonical combining classes are not available
to sort out the ordering of components.

> For some languages, you'll be typing several keystrokes, even if it's
> a single code point; there seems to be limited desire to allow you to
> "edit" the keystrokes.

The creators of the application do not know how many keystrokes were
used.  A multi-platform application is not likely to take note of what
keys were pressed even when this information is available.

> For other languages I would expect a UI design
> to cater to what local custom prefers.

Local custom?

'Local custom' is usually one of the following:

a) pen and ink, possibly with scraper.

b) typewriter and tippex

c) Hacked ASCII (and similar)

Only with complex ligatures would you not have access to each
character.

The only parallels to what happens now that I can think of that might
count as 'custom' are:

1) European 8-bit codes, where letter plus diacritic is treated as a
unit.

2) Korean, where one couldn't chop and change the individual jamo.

3) Thai, where a tone mark can severely restrict what scraping can do.

A UI design might respond to loud enough howls of user protest.  You
may recall Thai howls of protest when the ability to independently
delete preposed vowels was lost.  Thai may have some complex vowel
symbols, but as far as the grapheme clusters go, *Thai* doesn't get more
complicated than CVT (consonant, vowel (just one!) and tone).  Some of
the minority languages in the Thai script might be a bit more
complicated.

I do recall SIL's split cursor, which attempted to address the
difficulties of navigating through a stack of diacritics.  I miss it,
even though I never got to grips with all its subtleties.

What I believe is much more the case is that Unicode encourages 'one
size fits all'.  There are massive *translation* efforts for user
interfaces.  As to other parts of the text input/output, they are
usually separate from the applications.  The keyboard is almost totally
independent of the application.  Fonts are restricted to attempts to
provide adequate coverage, but the ideal is that the user provides his
own.  I think the LibreOffice search and replace interface says a lot.
It has visible support for Japanese - they holler and may well add
their own support into the core project - and there are some CTL
options which make best sense from the point of view of the Arabic
script.  The limitations on editing are one of the few places where the
UI is under the tight control of the programmers.  By and large, they
seem to be influenced by a few sources, such as the Unicode technical
reports.

Refutation awaited.

Now an attitude of 'one size fits all' does get things done.  It might
be a bit rough, but it's a lot better than nothing.

Richard.


Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 4:14 PM, Richard
  Wordingham wrote:


  respect to what to erase or undo.

  

  
For sequences that belong to a given language, you can pick the
behavior that makes most sense in them, but for lone surrogates, by
definition you are dealing with broken text that doesn't follow any
conventions.

  
  
Who's 'you'?  Customisation is frequently not available.  In fact, I
don't recall seeing it on offer.


The UI developer. 

And there's nothing Unicode can do about lack of customizability.

A,./

  



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 14:29:16 -0700
"Asmus Freytag (t)"  wrote:

> On 10/4/2015 12:38 PM, Richard Wordingham wrote:

> The problem you are trying to solve is to allow editing on
> the code point level, or, if you will, the keystroke level.

> Generally, there will be a sweet spot for each language (and each
> user) with respect to what to erase or undo.

> For sequences that belong to a given language, you can pick the
> behavior that makes most sense in them, but for lone surrogates, by
> definition you are dealing with broken text that doesn't follow any
> conventions.

Who's 'you'?  Customisation is frequently not available.  In fact, I
don't recall seeing it on offer.

> It should also be something that doesn't occur commonly. So, for all
> of those reasons, I see no particular problem with giving that a
> "generic" behavior, which could be that of deleting the entire
> combining sequence; especially if your interface normally deletes
> sequences as a unit.

> But in any case, the minimal requirement on an editor is that it lets
> you delete (and then retype) enough text to get it back to an
> uncorrupted state.

In the problem I hit, I would nearly be left with two options - never
having CANDRABINDU and always having it preceded by CANDRABINDU.
Whenever I enter CANDRABINDU, it is preceded by the lone surrogate.
Consequently, the option of retyping the sequence is of no avail.
Fortunately, in the application where I met the problem, the lone
surrogates, and nothing else, get deleted when the file is saved. The
problem could very easily be a lot worse.



> Catch-22 here. In filtering input to the dialog to prevent it from
> being used to corrupt text, you prevent it from being used to repair
> text. Interesting.

Not very different to having a very roll-stable aeroplane. If you ever
do end up upside-down, you have a big problem. 

Richard.


Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 2:35 PM, Richard
  Wordingham wrote:


  
However my opinion is that   𑒏�𑒺 (using U+FFFD substitution) gives 2
> grapheme clusters, I would prefer a solution that gives 3 grapheme
> clusters, as if the lone surrogate was a line-break control, so that
> the third character (combining, but just after the lone surrogate)
> will not combine with it but will be handled as a defective combining
> sequence with no starter at all before it.

  
  I'd much prefer to be able to delete the first character of a grapheme
cluster.  It's annoying to have to retype 4 characters because one's
mistyped the first of the 4 characters in a grapheme cluster.  Removing
the restriction would be much more useful.


That makes sense for common typos, less so, for
  uncommon (hopefully) data corruption.
  
  For some languages, you'll be typing several keystrokes, even if
  it's a single code point; there seems to be limited desire to
  allow you to "edit" the keystrokes. For other languages I would
  expect a UI design to cater to what local custom prefers. 
  
  A./

  



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 21:48:12 +0200
Philippe Verdy  wrote:

> 2015-10-04 21:30 GMT+02:00 Richard Wordingham <
> richard.wording...@ntlworld.com>:

> > On Sun, 4 Oct 2015 15:44:32 +0200
> > Mark Davis ☕️  wrote:

> > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does
> > > show the sequence 𑒏�𑒺 as just two grapheme clusters.

> > But that's the sequence , which has no
> > lone surrogates at all!

> Mark just said that it was what was shown, i.e. the lone surrogate got
> treated as U+FFFD.

That's not what the English says, and I'm surprised if that's what a
literal translation into French means.  I do half suspect that he
actually tried to post a lone surrogate.

> However my opinion is that   𑒏�𑒺 (using U+FFFD substitution) gives 2
> grapheme clusters, I would prefer a solution that gives 3 grapheme
> clusters, as if the lone surrogate was a line-break control, so that
> the third character (combining, but just after the lone surrogate)
> will not combine with it but will be handled as a defective combining
> sequence with no starter at all before it.

I'd much prefer to be able to delete the first character of a grapheme
cluster.  It's annoying to have to retype 4 characters because one's
mistyped the first of the 4 characters in a grapheme cluster.  Removing
the restriction would be much more useful.

Richard.



Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 12:38 PM, Richard
  Wordingham wrote:


  On Sun, 4 Oct 2015 10:50:43 -0700
Markus Scherer  wrote:


  
I would not spend any time specifying intricate rules for unpaired
surrogates in 16-bit strings, or out-of range values in 32-bit
strings. Most processing will treat them like unassigned characters,
like U+50005, with only default behaviors.

  
  
The core problem here is that many editors will not allow one to delete
just a non-initial character from a grapheme cluster.  I fear there may
be editors that don't even allow one to delete the final character.
This may not be a problem when one works with a small set of grapheme
clusters, as in French or German, or possibly even Vietnamese, but
becomes a problem when working with such a large set that the notion of
them being user-perceived characters strains credulity.


The problem you are trying to solve is to allow editing on the code
point level, or, if you will, the keystroke level. Generally, there
will be a sweet spot for each language (and each user) with respect
to what to erase or undo. 

For sequences that belong to a given language, you can pick the
behavior that makes most sense in them, but for lone surrogates, by
definition you are dealing with broken text that doesn't follow any
conventions.

It should also be something that doesn't occur commonly. So, for all
of those reasons, I see no particular problem with giving that a
"generic" behavior, which could be that of deleting the entire
combining sequence; especially if your interface normally deletes
sequences as a unit.

If it never treats sequences as units, then I would in fact question
why this should be different for surrogates.

But in any case, the minimal requirement on an editor is that it
lets you delete (and then retype) enough text to get it back to an
uncorrupted state.

  

A stray U+50005 before a combining mark would also be fiddly to get
rid of, but even if the editor does not allow the entry of arbitrary
scalar values, a user might fix the problem by creating an HTML file
containing the character and then copying the character from the HTML
file to a find and replace command.  This trick is unlikely to work for
a lone surrogate.


Catch-22 here. In filtering input to the dialog to prevent it from
being used to corrupt text, you prevent it from being used to repair
text. Interesting.

A./
  



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 12:30:23 -0700
"Asmus Freytag (t)"  wrote:

> If you have a bug that doesn't let you enter a sequence without
> creating a lone surrogate followed by a combining mark, that's a
> bug...

Unfortunately, the bug appears to be in an ill-defined interface in
which I have observed regression even within the BMP.  We've discussed
the ambiguity of 'delete one character' in the context of normalisation
before on this list, and the surest solution seemed to be for the
application to surrender some control of its 'backing store' to the
input method.

It's conceivable that the input methods that are compatible for the BMP
are incompatible in the supplementary planes. For now, I'm going to
have to either work round the problem by using dead keys instead or be
thankful that the application hasn't caught up with Unicode 7.0.

Richard.


Re: Deleting Lone Surrogates

2015-10-04 Thread Philippe Verdy
2015-10-04 21:30 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> On Sun, 4 Oct 2015 15:44:32 +0200
> Mark Davis ☕️  wrote:
>
> > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show
> > the sequence 𑒏�𑒺 as just two grapheme clusters.
>
> But that's the sequence , which has no lone
> surrogates at all!  (I had to look at the raw email file to be sure of
> what the text was - my email client displays U+FFFD and malformed
> alleged UTF-8 the same.)

Mark just said that it was what was shown, i.e. the lone surrogate got
treated as U+FFFD.
However my opinion is that   𑒏�𑒺 (using U+FFFD substitution) gives 2
grapheme clusters, I would prefer a solution that gives 3 grapheme
clusters, as if the lone surrogate was a line-break control, so that the
third character (combining, but just after the lone surrogate) will not
combine with it but will be handled as a defective combining sequence with
no starter at all before it.


Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 10:50:43 -0700
Markus Scherer  wrote:

> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit
> strings. Most processing will treat them like unassigned characters,
> like U+50005, with only default behaviors.

The core problem here is that many editors will not allow one to delete
just a non-initial character from a grapheme cluster.  I fear there may
be editors that don't even allow one to delete the final character.
This may not be a problem when one works with a small set of grapheme
clusters, as in French or German, or possibly even Vietnamese, but
becomes a problem when working with such a large set that the notion of
them being user-perceived characters strains credulity.

A stray U+50005 before a combining mark would also be fiddly to get
rid of, but even if the editor does not allow the entry of arbitrary
scalar values, a user might fix the problem by creating an HTML file
containing the character and then copying the character from the HTML
file to a find and replace command.  This trick is unlikely to work for
a lone surrogate.

Richard.


Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 15:44:32 +0200
Mark Davis ☕️  wrote:

> When I use http://unicode.org/cldr/utility/breaks.jsp, it does show
> the sequence 𑒏�𑒺 as just two grapheme clusters.

But that's the sequence , which has no lone
surrogates at all!  (I had to look at the raw email file to be sure of
what the text was - my email client displays U+FFFD and malformed
alleged UTF-8 the same.)  I believe I would have a good chance of
repairing that by replacing U+FFFD by nothing.

It's not even certain that the substitution to replace U+FFFD would
work. With a more fully supported script in LibreOffice, I would have to
switch 'CTL diacritic' matching off and hope that substitution replaced
the shortest match.  That currently works for replacing one Thai
consonant by another.  To systematically replace a non-spacing Thai
character by another, I have to resort to 'regular expression'
search and replace.  I must hope that they never choose to interpret
the search as matching extended grapheme clusters.

Do all Unicode character properties extend to all codepoints?  If not,
how does one tell which do and which don't?  If the Unicode
segmentation algorithms do apply to sequences of codepoints, as
opposed to merely to Unicode strings, then indeed  is
a legacy grapheme cluster.  It's an extremely unhelpful one!

> In #29 we are specifically not concerned about ill-formed text (or
> other degenerate cases). I suppose it would be possible to handle
> isolated surrogates in different way (eg always breaking) if it
> represented a common problem, but someone would have to make a very
> good case for that.

I suppose the argument will go that by using rare scripts or obsolete
characters, one deserves all the problems that one gets.  The only
widely used script where one is likely to encounter lone surrogates is
CJK, and they are less of a problem there.  Ideally, one shouldn't get
isolated surrogates, but when one does, the mechanisms intended to
prevent them occurring can make dealing with them difficult.

Richard.



Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 6:02 AM, Richard
  Wordingham wrote:


  In the absence of a specific tailoring, is the combination of a lone
surrogate and a combining mark a user-perceived character?  Does a lone
surrogate constitute a user-perceived character?


In an editing interface, a lone surrogate
  should be a user perceived character, as otherwise you won't be
  able to manually delete it. Markus suggests that it be treated
  like an unassigned code point.
  
  Now, if you follow an unassigned code point with a combining mark,
  what should you get?
  
  For scripts where combining marks are productive, it seems
  counter-productive (pardon the pun) to go and limit this process,
  only to have to update your software every year as a new version
  of Unicode comes out.
  
  (Astute readers will notice that combining marks don't necessarily
  have scripts, nor do unassigned code points, so I'm talking about
  those marks that are used productively with certain scripts and
  particularly those that can be applied widely ouf of context for
  technical purposes)
  
  So, if you allow a generalized algorithm that gloms these marks
  onto any base, even unassigned code points, then it would be
  natural to have this happen to lone surrogates as well, meaning
  that the surrogate cannot be fixed in isolation.  That's tough.
  There are plenty of interfaces where you can't change a base
  character in isolation.
  
  If you have a bug that doesn't let you enter a sequence without
  creating a lone surrogate followed by a combining mark, that's a
  bug...
  
  A./

  



Re: Deleting Lone Surrogates

2015-10-04 Thread Philippe Verdy
The default behavior of unassigned characters are to treat them like base
characters, so if they are followed by a combining mark, it would create a
default grapheme cluster, which is not appropriate here.

Surrogates are not chracters (so they cannot have any character
properties), but they are assigned and so don't have "default" properties
(only meant for *unassigned* codepoints).

I still think that it is safer to treat them (for text segmentation purpose
as pure isolates i.e. exactly like basic controls such as U+ NUL, or
such as the U+FFFD replacement control which is typically used as visible
placeholders for various errors).

For normalisation purpose they should also have combining class 0 (i.e.
acting as blockers against reorderings for canonical equivalences), and not
as "transparent" (discarded and bypassed as if those surrogates were not
present at all).

2015-10-04 19:50 GMT+02:00 Markus Scherer :

> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit strings.
> Most processing will treat them like unassigned characters, like U+50005,
> with only default behaviors.
> markus
>


Re: Deleting Lone Surrogates

2015-10-04 Thread Markus Scherer
I would not spend any time specifying intricate rules for unpaired
surrogates in 16-bit strings, or out-of range values in 32-bit strings.
Most processing will treat them like unassigned characters, like U+50005,
with only default behaviors.
markus


Re: Deleting Lone Surrogates

2015-10-04 Thread Philippe Verdy
IMHO, isolate surrogates are not valid starters for combining sequences,
they must remain isolate : deleting this surrogate in your text editor
should not delete the following combining mark which is a separate cluster
(even if that cluster is defective before the deletion as it has NO base
starter)
For default grapheme clusters, it would be helpful to add a rule to force a
cluster break before and after any lone surogate (i.e. for grapheme cluster
breaking, treat any lone character as if it were a control like NUL U+).

2015-10-04 15:02 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> In the absence of a specific tailoring, is the combination of a lone
> surrogate and a combining mark a user-perceived character?  Does a lone
> surrogate constitute a user-perceived character?
>
> The problem I have is that because of an application-specific bug,
> when I attempt to enter the sequence  U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code
> unit sequence , which is being interpreted as
> the codepoint sequence .
>
> (The problem seems to arise because I use a sequence of two key strokes
> to enter candrabindu, and the application or input mechanism has to undo
> the entry of a supplementary character entered in response to the first
> keystroke.  I've reported the problem as Bug 94753.)
>
> Because the lone surrogate is interpreted as the start of a
> user-perceived character, I can move the cursor to between U+1148F and
> U+D805.  Then pressing the 'delete' key (as opposed to the 'rubout'
> key) will delete the U+D805.  However, if the lone surrogate plus
> combining mark is a user-perceived character, then all I will be left
> with is .  At present the offending application is treating
> Tirhuta combining marks as user-perceived characters, but I suspect the
> application has simply not caught up with Unicode Version 7 yet.
>
> Richard.
>


Re: Deleting Lone Surrogates

2015-10-04 Thread Mark Davis ☕️
When I use http://unicode.org/cldr/utility/breaks.jsp, it does show the
sequence 𑒏�𑒺 as just two grapheme clusters.

In #29 we are specifically not concerned about ill-formed text (or other
degenerate cases). I suppose it would be possible to handle isolated
surrogates in different way (eg always breaking) if it represented a common
problem, but someone would have to make a very good case for that.


Mark 

*— Il meglio è l’inimico del bene —*

On Sun, Oct 4, 2015 at 3:02 PM, Richard Wordingham <
richard.wording...@ntlworld.com> wrote:

> In the absence of a specific tailoring, is the combination of a lone
> surrogate and a combining mark a user-perceived character?  Does a lone
> surrogate constitute a user-perceived character?
>
> The problem I have is that because of an application-specific bug,
> when I attempt to enter the sequence  U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code
> unit sequence , which is being interpreted as
> the codepoint sequence .
>
> (The problem seems to arise because I use a sequence of two key strokes
> to enter candrabindu, and the application or input mechanism has to undo
> the entry of a supplementary character entered in response to the first
> keystroke.  I've reported the problem as Bug 94753.)
>
> Because the lone surrogate is interpreted as the start of a
> user-perceived character, I can move the cursor to between U+1148F and
> U+D805.  Then pressing the 'delete' key (as opposed to the 'rubout'
> key) will delete the U+D805.  However, if the lone surrogate plus
> combining mark is a user-perceived character, then all I will be left
> with is .  At present the offending application is treating
> Tirhuta combining marks as user-perceived characters, but I suspect the
> application has simply not caught up with Unicode Version 7 yet.
>
> Richard.
>