Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Karl Williamson

On 11/06/2015 01:32 PM, Richard Wordingham wrote:

On Thu, 05 Nov 2015 13:41:42 -0700
"Doug Ewell"  wrote:


Richard Wordingham wrote:


No-one's claiming it is for a Unicode Transformation Format (UTF).


Then they ought not to call it "UTF-8" or "extended" or "modified"
UTF-8, or anything of the sort, even if the bit-shifting algorithm is
based on UTF-8.



"UTF-8 encoding form" is defined as a mapping of Unicode scalar values
-- not arbitrary integers -- onto byte sequences. [D92]


If it extends the mapping of Unicode scalar values *into* byte
sequences, then it's an extension.  A non-trivial extension of a
mapping of scalar values has to have a larger domain.

I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks.

Richard.



I have no idea how my original message ended up being marked to send to 
this list.  I'm sorry.  It was meant to be a personal message for 
someone who I believe was involved in the original design.


Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Richard Wordingham
On Thu, 05 Nov 2015 13:41:42 -0700
"Doug Ewell"  wrote:

> Richard Wordingham wrote:
> 
> > No-one's claiming it is for a Unicode Transformation Format (UTF).
> 
> Then they ought not to call it "UTF-8" or "extended" or "modified"
> UTF-8, or anything of the sort, even if the bit-shifting algorithm is
> based on UTF-8.

> "UTF-8 encoding form" is defined as a mapping of Unicode scalar values
> -- not arbitrary integers -- onto byte sequences. [D92]

If it extends the mapping of Unicode scalar values *into* byte
sequences, then it's an extension.  A non-trivial extension of a
mapping of scalar values has to have a larger domain.

I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks.

Richard.


Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Otto Stolz

Am 05.11.2015 um 23:11 schrieb Ilya Zakharevich:

First of all, “reserved” means that they have no meaning.  Right?


Almost.

“Reserved” means that they have currently no meaning
but may be assigned a meaning, later; hence you ought
not use them lest your programs, or data, be invalidated
by later amendmends of the pertinent specification.

In contrast, “invalid”, or “ill-formed” (Unicode term),
means that the particular bit pattern may never be used
in a sequence that purports to represent Unicode characters.
In practice, that means that no programm is allowed to
send those ill-formed patterns in Unicode-based data exchange,
and every program should refuse to accept those ill-formed
patterns, in Unicode-based data exchange.

What a program does internally is at the discretion (or should
I say: “whim”?) of its author, of course – as long as the
overall effect of the program complies with the standard.

Best wishes,
  Otto Stolz







Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Philippe Verdy
2015-11-05 23:11 GMT+01:00 Ilya Zakharevich  wrote
>
>   • 128-bit architectures may be at hand (sooner or later).

This is specialation for something that is still not envisioned: a global
worldwide working space where users and applications would interoperate
transparently in a giant virtualized environment. However, this virtualized
environment will be supported by 64-bit OSes that will never need native
support of more the 64-bit pointers. Those 128-bit entities needed for
adressing will not be used to work on units of data but to address some
small selection of remote entities.

Softwares that would requiring parsing coompletely chunks of memory data
larger than 64-bit would be extremely inefficient, instead this data will
be internally structured/paged, and only virtually mapped to some 128 bit
global reference (such as GUID/UUIDs) only to select smaller chunks within
the structure (and in most cases those chunks will remain in a 32-bit space
(even in today's 64-bit OSes, the largest pages are 20-bit wide, but
typically 10-bit wide (512-byte sectors) to 12-bit wide (standard VMM and
I/O page sizes, networking MTUs), or about 16-bit wide (such as
transmission window for TCP). This will not eveolve significantly before a
major evolution in the worldwide Internet backbones requiring more than
about 1Gigabit/s (a speed not even needed for 4K HD video, but needed only
in massive computing grids, still built with a complex mesh of much slower
data links).

With 64-bit we already reach the physical limits of networking links, and
higher speeds using large buses are only for extremely local links whose
lengths are largely below a few millimters within chips themselves.

128 bit however is possible not for the working spaces (or document sizes)
it will be very unlikely that ANSI C/C++ "size_t" type will be more than
64-bit (ecept for a few experimentations which will fail to be more
efficient).

What is more realist is that internal buses and caches will be 128 bits or
even larger (this is already true for GPU memory), only to support more
parallelism or massive parallelism (and typically by using vectored
instructions working on sets of smaller values).

And some data need 128-bit values for their numerical ranges (ALUs in
CPU/GPU/APU are already 128-bit, as well as common floating point types)
where extra precision is necessary.

I doubt we'll ever see any true native 128-bit architecture in any time of
our remaining life. We are still very far from the limit of the 64-bit
architecture and it won't happend before the next century (if the current
sequential binary model for computing is still used at that time, may be
computing will use predictive technologies returning only heuristic results
with a very high probability of giving a good solution to the problems
we'll need to solve extremely rapidly, and those solutions will then be
validated using today's binary logic with 64-bit computing).

Even in the case where a global 128-bit networking space would appear,
users will never be exposed to all that, msot of this content will be
unacessible to them (restricted by secuiry concerns or privacy) and simply
unmanageable by them : no one on earth is able to have any idea of what
2^64 bits of global data represents, no one will ever need it in their
whole life. That amount of data will only be partly implemented by large
organisations trying to build a giant cloud and whiching to interoperate by
coordinating their addressing spaces (for that we have now IPv6).

So your "sooner or later" is very optimistic.

IMHO we'll stay with 64-bit architectures for very long, up to the time
where our seuqnetial computing model will be deprecated and the concept of
native integer sizes will be obsoleted and replaced by other kinds of
computing "units" (notably parallel vectors, distributed computing, and
heuristic computing, or may be optical computing based on Fourier
transforms on analog signals or quantum computing, where our simple notion
of "integers" or even "bits" will not even be placeable into individual
physically placed units; their persistence will not even be localized, and
there will be redundant/fault-tolerant placements).

In fact our computing limits wil no longer be in terms of storage space,
but in terms of access time, distance and predictability of results.

The next technologies for faster computing will be certainly
predictive/probabilistic rather than affirmative (with today's Turing/Von
Neumann machines). "Algorithms" for working with it will be completely
different. Fuzzy logic will be everywhere and we'll even need less the
binary logic except for small problems. We'll have to live with the
possibility of errors but anyway we already have to live with them evne
with our binary logic (due to human bugs, haardware faults, accidents, and
so on...) In most problems we don't even need to have 100% proven solutions

(e.g. viewing a high-quality video, we already accept the possibility of
some "quirks" occ

Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Ilya Zakharevich
On Thu, Nov 05, 2015 at 08:57:16AM -0700, Karl Williamson wrote:
> Several of us are wondering about the reason for reserving bits for
> the extended UTF-8 in perl5.  I'm asking you because you are the
> apparent author of the commits that did this.

To start, the INTERNAL REPRESENTATION of Perl’s strings is the «utf8»
format (not «UTF-8», «extended» or not).  [I see that this misprint
caused a lot of stir here!]

However, outside of a few contexts, this internal representation
should not be visible.  (However, some of these contexts are close to
the default, like read/write in Unicode mode, with -C switch.)

Perl’s string is just a sequence of Perl’s unsigned integers.
[Depending on the build, this may be, currently, 32-bit or 64-bit.]
By convention, the “meaning” of small integers coincides with what
Unicode says.

> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes
> the length of the sequence of bytes that comprise a single character
> to be 13 bytes.  This allows code points up to 2**72 - 1 to be
> represented. If the length had been instead 12 bytes, code points up
> to 2**66 - 1 could be represented, which is enough to represent any
> code point possible in a 64-bit word.
> 
> The comments indicate that these extra bits are "reserved".  So
> we're wondering what potential use you had thought of for these
> bits.

First of all, “reserved” means that they have no meaning.  Right?

Second, there are 2 ways in which one may need this INTERNAL format to
be extended:
  • 128-bit architectures may be at hand (sooner or later).
  • One may need to allow “objects” to be embedded into Perl strings.

With embedded objects, one must know how to kill them when the string
(or its part) is removed.  So, while a pointer can fit into a Perl
integer, one needs to specify what to do: call DESTROY, or free(), or
a user-defined function.

This gives 5 possibilities (3 extra bits) which may be needed with
“slots” in Perl strings.
  • Integer (≤64 bits)
  • Integer (≥65 bits) 
  • Pointer to a Perl object
  • Pointer to a malloc()ed memory
  • Pointer to a struct which knows how to destroy itself.
  struct self_destroy { void *content; void destroy(struct self_destroy*); }

Why one may need objects embedded into strings?  I explained it in
   http://ilyaz.org/interview
(look for «Emacs» near the middle).

Hope this helps,
Ilya


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Doug Ewell
Richard Wordingham wrote:

> No-one's claiming it is for a Unicode Transformation Format (UTF).

Then they ought not to call it "UTF-8" or "extended" or "modified"
UTF-8, or anything of the sort, even if the bit-shifting algorithm is
based on UTF-8.

"UTF-8 encoding form" is defined as a mapping of Unicode scalar values
-- not arbitrary integers -- onto byte sequences. [D92]

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Richard Wordingham
On Thu, 5 Nov 2015 18:25:05 +0100
Philippe Verdy  wrote:

> But these extra code points could be used to represent someting else
> such as unique object identifier for internal use in your
> application, or virtual object pointers, or or shared memory block
> handles, file/pipe/stream I/O handles, service/API handles, user ids,
> security tokens, 64-bit content hashes plus some binary flags,
> placeholders/references for members in an external unencoded
> collection or for URIs, or internal glyph ids when converting text
> for rendering with one or more fonts, or some internal serialization
> of geometric shapes/colors/styles/visual effects...)

No-one's claiming it is for a Unicode Transformation Format (UTF).  A
possibly relevant example of a something else is a non-precomposed
grapheme cluster, as in Perl6's NFG.  (This isn't a PUA encoding, as
the precomposed characters are created on the fly.)

Richard.


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Markus Scherer
On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdy  wrote:

> (0xFF was reserved only in the old RFC version of UTF-8 when it allowed
> code points up to 31 bits, but even this RFC is obsolete and should no
> longer be used and it has never been approved by Unicode).
>

No, even in the original UTF-8 definition, "The octet values FE and FF
never appear." https://tools.ietf.org/html/rfc2279
The highest lead byte was 0xFD.

(For the "really original" version see
http://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf)

In the current definition, "The octet values C0, C1, F5 to FF never
appear." https://tools.ietf.org/html/rfc3629 =
https://tools.ietf.org/html/std63

markus


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Philippe Verdy
It won't represent any valid Unicode codepoint (no standard scalar value
defined), so if you use those leading bytes, don't pretend it is for
"UTF-8" (not even "modified UTF-8" which is the variant created in Java for
its internal serialization of unrestricted 16-bit strings, including for
lone surrogates, and modified also in its representation of U+ as
<0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create
your own charset identifier (e.g. "perl5-UTF-8-extended" or some name
derived from your Perl5 library) and say it is not fot use for interchange
of standard text.

The extra code points you'll get are then necessarily for private use (but
still not part of the standard PUA set), and have absolutely no defined
properties from the standard. They should not be used to represent any
Unicode character or character sequence. In any API taking some text input,
those code points will never be decoded and will behave on input like
encoding errors.

But these extra code points could be used to represent someting else such
as unique object identifier for internal use in your application, or
virtual object pointers, or or shared memory block handles,
file/pipe/stream I/O handles, service/API handles, user ids, security
tokens, 64-bit content hashes plus some binary flags,
placeholders/references for members in an external unencoded collection or
for URIs, or internal glyph ids when converting text for rendering with one
or more fonts, or some internal serialization of geometric
shapes/colors/styles/visual effects...)

In the standard UTF-8 those extra byte values are not "reserved" but
permanently assigned to be "invalid", and there are no valid encoded
sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC
version of UTF-8 when it allowed code points up to 31 bits, but even this
RFC is obsolete and should no longer be used and it has never been approved
by Unicode).


2015-11-05 16:57 GMT+01:00 Karl Williamson :

> Hi,
>
> Several of us are wondering about the reason for reserving bits for the
> extended UTF-8 in perl5.  I'm asking you because you are the apparent
> author of the commits that did this.
>
> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the
> length of the sequence of bytes that comprise a single character to be 13
> bytes.  This allows code points up to 2**72 - 1 to be represented. If the
> length had been instead 12 bytes, code points up to 2**66 - 1 could be
> represented, which is enough to represent any code point possible in a
> 64-bit word.
>
> The comments indicate that these extra bits are "reserved".  So we're
> wondering what potential use you had thought of for these bits.
>
> Thanks
>
> Karl Williamson
>


Question about Perl5 extended UTF-8 design

2015-11-05 Thread Karl Williamson

Hi,

Several of us are wondering about the reason for reserving bits for the 
extended UTF-8 in perl5.  I'm asking you because you are the apparent 
author of the commits that did this.


To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the 
length of the sequence of bytes that comprise a single character to be 
13 bytes.  This allows code points up to 2**72 - 1 to be represented. 
If the length had been instead 12 bytes, code points up to 2**66 - 1 
could be represented, which is enough to represent any code point 
possible in a 64-bit word.


The comments indicate that these extra bits are "reserved".  So we're 
wondering what potential use you had thought of for these bits.


Thanks

Karl Williamson