It won't represent any valid Unicode codepoint (no standard scalar value defined), so if you use those leading bytes, don't pretend it is for "UTF-8" (not even "modified UTF-8" which is the variant created in Java for its internal serialization of unrestricted 16-bit strings, including for lone surrogates, and modified also in its representation of U+0000 as <0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create your own charset identifier (e.g. "perl5-UTF-8-extended" or some name derived from your Perl5 library) and say it is not fot use for interchange of standard text.
The extra code points you'll get are then necessarily for private use (but still not part of the standard PUA set), and have absolutely no defined properties from the standard. They should not be used to represent any Unicode character or character sequence. In any API taking some text input, those code points will never be decoded and will behave on input like encoding errors. But these extra code points could be used to represent someting else such as unique object identifier for internal use in your application, or virtual object pointers, or or shared memory block handles, file/pipe/stream I/O handles, service/API handles, user ids, security tokens, 64-bit content hashes plus some binary flags, placeholders/references for members in an external unencoded collection or for URIs, or internal glyph ids when converting text for rendering with one or more fonts, or some internal serialization of geometric shapes/colors/styles/visual effects...) In the standard UTF-8 those extra byte values are not "reserved" but permanently assigned to be "invalid", and there are no valid encoded sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC version of UTF-8 when it allowed code points up to 31 bits, but even this RFC is obsolete and should no longer be used and it has never been approved by Unicode). 2015-11-05 16:57 GMT+01:00 Karl Williamson <[email protected]>: > Hi, > > Several of us are wondering about the reason for reserving bits for the > extended UTF-8 in perl5. I'm asking you because you are the apparent > author of the commits that did this. > > To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the > length of the sequence of bytes that comprise a single character to be 13 > bytes. This allows code points up to 2**72 - 1 to be represented. If the > length had been instead 12 bytes, code points up to 2**66 - 1 could be > represented, which is enough to represent any code point possible in a > 64-bit word. > > The comments indicate that these extra bits are "reserved". So we're > wondering what potential use you had thought of for these bits. > > Thanks > > Karl Williamson >

