[
https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772151#action_12772151
]
Steven Rowe commented on LUCENE-2019:
-------------------------------------
bq. process-internal is somethign that won't be stored or interchanged in any
way (internal to the process)
Right, this is the crux of the disagreement: you think storage (with the
exception of in-memory usage) means interchange. I and Yonik think that
storage does not necessarily mean interchange.
Section 16.7 (_Noncharacters_) of the Unicode 5.0.0 standand (the latest
version for which an electronic version of this chapter is available), says:
{quote}
Noncharacters are code points that are permanently reserved in the Unicode
Standard for internal use. They are forbidden for use in open interchange of
Unicode text data. See Section 3.4, Characters and Encoding, for the formal
definition of noncharacters and conformance requirements related to their use.
The Unicode Standard sets aside 66 noncharacter code points. The last two code
points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE
and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for
a total of 34 code points. In addition, there is a contiguous range of another
32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons,
the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A
block, but those noncharacters are not "Arabic noncharacters" or "right-to-left
noncharacters," and are not distinguished in any other way from the other
noncharacters, except in their code point values.
Applications are free to use any of these noncharacter code points internally
but should never attempt to exchange them. If a noncharacter is received in
open interchange, an application is not required to interpret it in any way. It
is good practice, however, to recognize it as a noncharacter and to take
appropriate action, such as removing it from the text. Note that Unicode
conformance freely allows the removal of these characters. (See conformance
clause C7 in Section 3.2, Conformance Requirements.)
In effect, noncharacters can be thought of as application-internal private-use
code points. Unlike the private-use characters discussed in Section 16.5,
Private-Use Characters, which are assigned characters and which are intended
for use in open interchange, subject to interpretation by private agreement,
noncharacters are permanently reserved (unassigned) and have no interpretation
whatsoever outside of their possible application-internal private uses.
*U+FFFF and U+10FFFF.* These two noncharacter code points have the attribute
of being associated with the largest code unit values for particular Unicode
encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code
unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit
code unit value, 10FFFF16. This attribute renders these two noncharacter code
points useful for internal purposes as sentinels. For example, they might be
used to indicate the end of a list, to represent a value in an index guaranteed
to be higher than any valid character value, and so on.
{quote}
(I left out the last part about U+FFFE.)
Again, the crux of the matter is the definition of "open interchange".
> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
> Key: LUCENE-2019
> URL: https://issues.apache.org/jira/browse/LUCENE-2019
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Robert Muir
> Priority: Minor
> Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store
> these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can
> be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF
> process-internally, it can't be in the index or will cause problems.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]