Re: Question about Perl5 extended UTF-8 design
On 11/06/2015 01:32 PM, Richard Wordingham wrote: On Thu, 05 Nov 2015 13:41:42 -0700 "Doug Ewell" wrote: Richard Wordingham wrote: No-one's claiming it is for a Unicode Transformation Format (UTF). Then they ought not to call it "UTF-8" or "extended" or "modified" UTF-8, or anything of the sort, even if the bit-shifting algorithm is based on UTF-8. "UTF-8 encoding form" is defined as a mapping of Unicode scalar values -- not arbitrary integers -- onto byte sequences. [D92] If it extends the mapping of Unicode scalar values *into* byte sequences, then it's an extension. A non-trivial extension of a mapping of scalar values has to have a larger domain. I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks. Richard. I have no idea how my original message ended up being marked to send to this list. I'm sorry. It was meant to be a personal message for someone who I believe was involved in the original design.
Re: Question about Perl5 extended UTF-8 design
On Thu, 05 Nov 2015 13:41:42 -0700 "Doug Ewell" wrote: > Richard Wordingham wrote: > > > No-one's claiming it is for a Unicode Transformation Format (UTF). > > Then they ought not to call it "UTF-8" or "extended" or "modified" > UTF-8, or anything of the sort, even if the bit-shifting algorithm is > based on UTF-8. > "UTF-8 encoding form" is defined as a mapping of Unicode scalar values > -- not arbitrary integers -- onto byte sequences. [D92] If it extends the mapping of Unicode scalar values *into* byte sequences, then it's an extension. A non-trivial extension of a mapping of scalar values has to have a larger domain. I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks. Richard.
Re: Question about Perl5 extended UTF-8 design
Am 05.11.2015 um 23:11 schrieb Ilya Zakharevich: First of all, “reserved” means that they have no meaning. Right? Almost. “Reserved” means that they have currently no meaning but may be assigned a meaning, later; hence you ought not use them lest your programs, or data, be invalidated by later amendmends of the pertinent specification. In contrast, “invalid”, or “ill-formed” (Unicode term), means that the particular bit pattern may never be used in a sequence that purports to represent Unicode characters. In practice, that means that no programm is allowed to send those ill-formed patterns in Unicode-based data exchange, and every program should refuse to accept those ill-formed patterns, in Unicode-based data exchange. What a program does internally is at the discretion (or should I say: “whim”?) of its author, of course – as long as the overall effect of the program complies with the standard. Best wishes, Otto Stolz
Re: Question about Perl5 extended UTF-8 design
2015-11-05 23:11 GMT+01:00 Ilya Zakharevich wrote > > • 128-bit architectures may be at hand (sooner or later). This is specialation for something that is still not envisioned: a global worldwide working space where users and applications would interoperate transparently in a giant virtualized environment. However, this virtualized environment will be supported by 64-bit OSes that will never need native support of more the 64-bit pointers. Those 128-bit entities needed for adressing will not be used to work on units of data but to address some small selection of remote entities. Softwares that would requiring parsing coompletely chunks of memory data larger than 64-bit would be extremely inefficient, instead this data will be internally structured/paged, and only virtually mapped to some 128 bit global reference (such as GUID/UUIDs) only to select smaller chunks within the structure (and in most cases those chunks will remain in a 32-bit space (even in today's 64-bit OSes, the largest pages are 20-bit wide, but typically 10-bit wide (512-byte sectors) to 12-bit wide (standard VMM and I/O page sizes, networking MTUs), or about 16-bit wide (such as transmission window for TCP). This will not eveolve significantly before a major evolution in the worldwide Internet backbones requiring more than about 1Gigabit/s (a speed not even needed for 4K HD video, but needed only in massive computing grids, still built with a complex mesh of much slower data links). With 64-bit we already reach the physical limits of networking links, and higher speeds using large buses are only for extremely local links whose lengths are largely below a few millimters within chips themselves. 128 bit however is possible not for the working spaces (or document sizes) it will be very unlikely that ANSI C/C++ "size_t" type will be more than 64-bit (ecept for a few experimentations which will fail to be more efficient). What is more realist is that internal buses and caches will be 128 bits or even larger (this is already true for GPU memory), only to support more parallelism or massive parallelism (and typically by using vectored instructions working on sets of smaller values). And some data need 128-bit values for their numerical ranges (ALUs in CPU/GPU/APU are already 128-bit, as well as common floating point types) where extra precision is necessary. I doubt we'll ever see any true native 128-bit architecture in any time of our remaining life. We are still very far from the limit of the 64-bit architecture and it won't happend before the next century (if the current sequential binary model for computing is still used at that time, may be computing will use predictive technologies returning only heuristic results with a very high probability of giving a good solution to the problems we'll need to solve extremely rapidly, and those solutions will then be validated using today's binary logic with 64-bit computing). Even in the case where a global 128-bit networking space would appear, users will never be exposed to all that, msot of this content will be unacessible to them (restricted by secuiry concerns or privacy) and simply unmanageable by them : no one on earth is able to have any idea of what 2^64 bits of global data represents, no one will ever need it in their whole life. That amount of data will only be partly implemented by large organisations trying to build a giant cloud and whiching to interoperate by coordinating their addressing spaces (for that we have now IPv6). So your "sooner or later" is very optimistic. IMHO we'll stay with 64-bit architectures for very long, up to the time where our seuqnetial computing model will be deprecated and the concept of native integer sizes will be obsoleted and replaced by other kinds of computing "units" (notably parallel vectors, distributed computing, and heuristic computing, or may be optical computing based on Fourier transforms on analog signals or quantum computing, where our simple notion of "integers" or even "bits" will not even be placeable into individual physically placed units; their persistence will not even be localized, and there will be redundant/fault-tolerant placements). In fact our computing limits wil no longer be in terms of storage space, but in terms of access time, distance and predictability of results. The next technologies for faster computing will be certainly predictive/probabilistic rather than affirmative (with today's Turing/Von Neumann machines). "Algorithms" for working with it will be completely different. Fuzzy logic will be everywhere and we'll even need less the binary logic except for small problems. We'll have to live with the possibility of errors but anyway we already have to live with them evne with our binary logic (due to human bugs, haardware faults, accidents, and so on...) In most problems we don't even need to have 100% proven solutions (e.g. viewing a high-quality video, we already accept the possibility of some "quirks" occ
Re: Question about Perl5 extended UTF-8 design
On Thu, Nov 05, 2015 at 08:57:16AM -0700, Karl Williamson wrote: > Several of us are wondering about the reason for reserving bits for > the extended UTF-8 in perl5. I'm asking you because you are the > apparent author of the commits that did this. To start, the INTERNAL REPRESENTATION of Perl’s strings is the «utf8» format (not «UTF-8», «extended» or not). [I see that this misprint caused a lot of stir here!] However, outside of a few contexts, this internal representation should not be visible. (However, some of these contexts are close to the default, like read/write in Unicode mode, with -C switch.) Perl’s string is just a sequence of Perl’s unsigned integers. [Depending on the build, this may be, currently, 32-bit or 64-bit.] By convention, the “meaning” of small integers coincides with what Unicode says. > To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes > the length of the sequence of bytes that comprise a single character > to be 13 bytes. This allows code points up to 2**72 - 1 to be > represented. If the length had been instead 12 bytes, code points up > to 2**66 - 1 could be represented, which is enough to represent any > code point possible in a 64-bit word. > > The comments indicate that these extra bits are "reserved". So > we're wondering what potential use you had thought of for these > bits. First of all, “reserved” means that they have no meaning. Right? Second, there are 2 ways in which one may need this INTERNAL format to be extended: • 128-bit architectures may be at hand (sooner or later). • One may need to allow “objects” to be embedded into Perl strings. With embedded objects, one must know how to kill them when the string (or its part) is removed. So, while a pointer can fit into a Perl integer, one needs to specify what to do: call DESTROY, or free(), or a user-defined function. This gives 5 possibilities (3 extra bits) which may be needed with “slots” in Perl strings. • Integer (≤64 bits) • Integer (≥65 bits) • Pointer to a Perl object • Pointer to a malloc()ed memory • Pointer to a struct which knows how to destroy itself. struct self_destroy { void *content; void destroy(struct self_destroy*); } Why one may need objects embedded into strings? I explained it in http://ilyaz.org/interview (look for «Emacs» near the middle). Hope this helps, Ilya
Re: Question about Perl5 extended UTF-8 design
Richard Wordingham wrote: > No-one's claiming it is for a Unicode Transformation Format (UTF). Then they ought not to call it "UTF-8" or "extended" or "modified" UTF-8, or anything of the sort, even if the bit-shifting algorithm is based on UTF-8. "UTF-8 encoding form" is defined as a mapping of Unicode scalar values -- not arbitrary integers -- onto byte sequences. [D92] -- Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
Re: Question about Perl5 extended UTF-8 design
On Thu, 5 Nov 2015 18:25:05 +0100 Philippe Verdy wrote: > But these extra code points could be used to represent someting else > such as unique object identifier for internal use in your > application, or virtual object pointers, or or shared memory block > handles, file/pipe/stream I/O handles, service/API handles, user ids, > security tokens, 64-bit content hashes plus some binary flags, > placeholders/references for members in an external unencoded > collection or for URIs, or internal glyph ids when converting text > for rendering with one or more fonts, or some internal serialization > of geometric shapes/colors/styles/visual effects...) No-one's claiming it is for a Unicode Transformation Format (UTF). A possibly relevant example of a something else is a non-precomposed grapheme cluster, as in Perl6's NFG. (This isn't a PUA encoding, as the precomposed characters are created on the fly.) Richard.
Re: Question about Perl5 extended UTF-8 design
On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdy wrote: > (0xFF was reserved only in the old RFC version of UTF-8 when it allowed > code points up to 31 bits, but even this RFC is obsolete and should no > longer be used and it has never been approved by Unicode). > No, even in the original UTF-8 definition, "The octet values FE and FF never appear." https://tools.ietf.org/html/rfc2279 The highest lead byte was 0xFD. (For the "really original" version see http://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf) In the current definition, "The octet values C0, C1, F5 to FF never appear." https://tools.ietf.org/html/rfc3629 = https://tools.ietf.org/html/std63 markus
Re: Question about Perl5 extended UTF-8 design
It won't represent any valid Unicode codepoint (no standard scalar value defined), so if you use those leading bytes, don't pretend it is for "UTF-8" (not even "modified UTF-8" which is the variant created in Java for its internal serialization of unrestricted 16-bit strings, including for lone surrogates, and modified also in its representation of U+ as <0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create your own charset identifier (e.g. "perl5-UTF-8-extended" or some name derived from your Perl5 library) and say it is not fot use for interchange of standard text. The extra code points you'll get are then necessarily for private use (but still not part of the standard PUA set), and have absolutely no defined properties from the standard. They should not be used to represent any Unicode character or character sequence. In any API taking some text input, those code points will never be decoded and will behave on input like encoding errors. But these extra code points could be used to represent someting else such as unique object identifier for internal use in your application, or virtual object pointers, or or shared memory block handles, file/pipe/stream I/O handles, service/API handles, user ids, security tokens, 64-bit content hashes plus some binary flags, placeholders/references for members in an external unencoded collection or for URIs, or internal glyph ids when converting text for rendering with one or more fonts, or some internal serialization of geometric shapes/colors/styles/visual effects...) In the standard UTF-8 those extra byte values are not "reserved" but permanently assigned to be "invalid", and there are no valid encoded sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC version of UTF-8 when it allowed code points up to 31 bits, but even this RFC is obsolete and should no longer be used and it has never been approved by Unicode). 2015-11-05 16:57 GMT+01:00 Karl Williamson : > Hi, > > Several of us are wondering about the reason for reserving bits for the > extended UTF-8 in perl5. I'm asking you because you are the apparent > author of the commits that did this. > > To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the > length of the sequence of bytes that comprise a single character to be 13 > bytes. This allows code points up to 2**72 - 1 to be represented. If the > length had been instead 12 bytes, code points up to 2**66 - 1 could be > represented, which is enough to represent any code point possible in a > 64-bit word. > > The comments indicate that these extra bits are "reserved". So we're > wondering what potential use you had thought of for these bits. > > Thanks > > Karl Williamson >
Question about Perl5 extended UTF-8 design
Hi, Several of us are wondering about the reason for reserving bits for the extended UTF-8 in perl5. I'm asking you because you are the apparent author of the commits that did this. To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the length of the sequence of bytes that comprise a single character to be 13 bytes. This allows code points up to 2**72 - 1 to be represented. If the length had been instead 12 bytes, code points up to 2**66 - 1 could be represented, which is enough to represent any code point possible in a 64-bit word. The comments indicate that these extra bits are "reserved". So we're wondering what potential use you had thought of for these bits. Thanks Karl Williamson