[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Steven Rowe (JIRA) Fri, 30 Oct 2009 15:58:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772151#action_12772151
 ]


Steven Rowe commented on LUCENE-2019:
-------------------------------------

bq. process-internal is somethign that won't be stored or interchanged in any 
way (internal to the process)

Right, this is the crux of the disagreement: you think storage (with the 
exception of in-memory usage) means interchange.  I and Yonik think that 
storage does not necessarily mean interchange.

Section 16.7 (_Noncharacters_) of the Unicode 5.0.0 standand (the latest 
version for which an electronic version of this chapter is available), says:

{quote}
Noncharacters are code points that are permanently reserved in the Unicode 
Standard for internal use. They are forbidden for use in open interchange of 
Unicode text data. See Section 3.4, Characters and Encoding, for the formal 
definition of noncharacters and conformance requirements related to their use.

The Unicode Standard sets aside 66 noncharacter code points. The last two code 
points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE 
and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for 
a total of 34 code points. In addition, there is a contiguous range of another 
32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, 
the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A 
block, but those noncharacters are not "Arabic noncharacters" or "right-to-left 
noncharacters," and are not distinguished in any other way from the other 
noncharacters, except in their code point values.

Applications are free to use any of these noncharacter code points internally 
but should never attempt to exchange them. If a noncharacter is received in 
open interchange, an application is not required to interpret it in any way. It 
is good practice, however, to recognize it as a noncharacter and to take 
appropriate action, such as removing it from the text. Note that Unicode 
conformance freely allows the removal of these characters. (See conformance 
clause C7 in Section 3.2, Conformance Requirements.)

In effect, noncharacters can be thought of as application-internal private-use 
code points. Unlike the private-use characters discussed in Section 16.5, 
Private-Use Characters, which are assigned characters and which are intended 
for use in open interchange, subject to interpretation by private agreement, 
noncharacters are permanently reserved (unassigned) and have no interpretation 
whatsoever outside of their possible application-internal private uses.

*U+FFFF and U+10FFFF.*  These two noncharacter code points have the attribute 
of being associated with the largest code unit values for particular Unicode 
encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code 
unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit 
code unit value, 10FFFF16. This attribute renders these two noncharacter code 
points useful for internal purposes as sentinels. For example, they might be 
used to indicate the end of a list, to represent a value in an index guaranteed 
to be higher than any valid character value, and so on.
{quote}

(I left out the last part about U+FFFE.)

Again, the crux of the matter is the definition of "open interchange".

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store 
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can 
> be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF 
> process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Reply via email to