Re: UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other Code=D=A Points

verdy_p Wed, 04 Aug 2010 14:59:27 -0700

"Kenneth Whistler" 
> > Currently, if the Unicode scalar value (or invalid code unit) is NNNN
> > (unsigned 32-bit value), then they are treated as expansions to
> > ignorable collation elements:
> > [.0000.0000.0000.NNNN]
> 
> That statement is incorrect. The UCA currently specifies that
> ill-formed code unit sequences and *noncharacters* are mapped
> to [.0000.0000.0000.], but unassigned code points are not.


This is exactly equivalent: if you use strength level 3, they are both 
[.0000.0000.0000], if you need semi-stable 
sort keys, you NEED to add in your sortkeys the binary representation of scalar 
values. And the UCA already accepts 
the fact that ill-formed sequences may still be sorted without an error. The 
only way to do that with a semi-stable 
sort key, is to also include this scalar value as a final level in your sort 
key, even if it's ill-formed.

> > If we want to be smarter, we should not treat ALL the cases above as
> > fully ignorable at the first three levels, and should get primary
> > weights notably:
> 
> Hmmm, if we want to be smarter, we should read what the actual
> specification says.

That's what I did. If there's a contradiction for you, that's because the 
specification is amiguous on these points. 
I've read and re-read it many times before concluding that this was NOT fully 
specified (and then permitted under my 
interpretation).

> > so that they with primary weights lower than than those used for 
> > characters in the same block, but still higher that encoded characters 
> > from other blocks have that lower primary weights than assigned 
> > characters in the block. Gaps should be provided in the DUCET at 
> > the begining of ranges for these blocks so that they can all fit 
> > in them. The benefit being also that other blocks after them will 
> > keep their collation elements stable and won't be affected by the 
> > new allocations in one block.
> 
> That particular way of assigning implicit weights for unassigned
> characters would be a complete mess to implement for the default
> table.

Yes, I admit that it would create huge gaps everywhere, but it's not so 
critical for sinograms, that are encoded in 
a very compact way, with NO gap at all (given that they are assigned primary 
weights algorithmically from their 
scalar value). So mapping sinograms using the same scheme, even if they are 
still not encoded but at least within 
the assigned blocks or planes will make NO difference in the DUCET.

> A. It would substantially increase the size of the default table
> for *all* users, because it would assign primary weights for
> all unassigned code points inside blocks -- code points which
> now simply get implicit weights assigned by rule.

Yes, I admit it.

> B. The assumptions about better default behavior are erroneous,
> because they presuppose things which are not necessarily true. In
> particular, the main goal appears to be to assure well-behavedness
> for future additions on a per-script basis, since primary weight
> orders are relevant to scripts. However, several of the most important
> scripts are now, for historical reasons, encoded in multiple
> blocks. A rule which assigns default primary weights on a per
> block basis for unassigned characters would serve no valid purpose
> in such cases.

You can perfectly exclude positions that have been left unassigned in blocks 
only for compatibility reasons. We 
should know which they are (and in fact Unicode should then list them as 
permanently invalid characters. If it does 
not, it's because Unicode and ISO 10646 are still keeping the possibility of 
encoding new characters there, but this 
should only be for the relevant scripts to which these positions were left 
unallocated.

> C. In addition to randomizing primary weight assignments for
> scripts in the case of multiple-block scripts, such a rule would
> also introduce *more* unpredictability in cases of the punctuation
> and symbols which are scattered around among many, many blocks,
> as well.

No, it would not, by default and as long as they are not encoded, they will 
sort within the script to which these 
blocks were allocated. You can perfectly list all the relevant blocks that 
should be assigned weights together. 

> In general this proposal fails to understand that the default
> weights for DUCET (as expressed in allkeys.txt) has absolutely
> nothing whatsover to do with block identities or block
> ranges in the standard. The weighting algorithm knows absolutely
> nothing about block values.

Really ? Yes it depends more or less on the general category, but most 
additions in the existing blocks are for 
letters. Given that they sort at after all scripts, they already have an 
"unordered" position in collation. When 
they will be encoded, they will have to move anyway to their final position. 
This proposal does not suppress this 
possibility.

> > The other categories above (for code units exceeding the range of
> > valid scalar values if they are not treated as errors, or for code
> > points with valid scalar values and assigned to non-characters if they
> > are not treated as errors, or for code points with valid scalar values
> > assigned or reserved in the special supplementary plane) can be kept
> > as fully ignorable, using null weights on the (fully ignorable) first
> > three levels, and the implicit (last level) weights for scalar value
> > or code unit binary weights.
> 
> Except that such treatment is not optimal for the noncharacters.
> As noted in the review note in the proposed update for UTS #10,
> noncharacters should probably be given implicit weights, rather
> than being treated as ignorables by default. That is a proposed
> change to the specification.

Yes, I agree with you on this change of rule: they are permanently assigned, 
they have a meaning if they are ever 
used (even if it would create ill-formed sequences not representable in 
standard UTF's, and not interchangeable), 
and they can't be ignored silently.

The best that can be done is to allow sorting them all at end, with triling 
weights, instead of being skipped 
silently, for easier identification.

But anyway, given that such use would be purely local, applications are still 
free to handle them the way they want, 
according to their own local private conventions.

The only interesting option that would be "portable" would be if one wanted to 
collate texts according to UTF-16 
code unit scalar values, instead of Unicode scalar values, for the last binary 
weights appended to sort keys for 
semi-stability (this would be done for interoperability reasons with legacy 
systems that still do not support 
supplementary planes, i.e. the deprecating implementation Level 1 of ISO 
10646), so that the supplementary 
characters really encoded with surrogates (especially the supplementary 
sinograms, but also all the newly encoded 
historical scripts) would not be fully ignored.

Re: UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other Code=D=A Points

Reply via email to