[
https://issues.apache.org/jira/browse/LANG-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899421#comment-15899421
]
Rob Tompkins edited comment on LANG-1300 at 3/7/17 1:20 PM:
Github user chtompki commented on the issue:
https://github.com/apache/commons-lang/pull/251
@dmjones500 - no worries on the being busy, we all end up there for time to
time... :-)
@dmjones500 has an interesting point. The problem seems to lie with the
number of "Supplementary Code Points" preceding the *findable* `searchChar`
that have been previously split into their complementary surrogate pairs.
You may need to consider using `Character.isSurrogate(char ch)` as well as
`Character.isSurrogatePair(char high, char low)` for all characters preceding
our *findable* code point. Granted, that adds an *O( n )* multiplier on our
method's efficiency pushing us to *O(n2)*. It feels like only then
can we be absolutely certain that we are not over counting using *code units*
as opposed to *code points*.
If indeed we do move this direction, we should be quite clear, in the
javadoc, that there is a notable performance reduction when operating outside
the "Basic Multilingual Plane" (ref. [Oracle's Character
documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary)).
@PascalSchumacher - you have any thoughts here as well?
was (Author: githubbot):
Github user chtompki commented on the issue:
https://github.com/apache/commons-lang/pull/251
@dmjones500 - no worries on the being busy, we all end up there for time to
time... :-)
@dmjones500 has an interesting point. The problem seems to lie with the
number of "Supplementary Code Points" preceding the *findable* `searchChar`
that have been previously split into their complementary surrogate pairs.
You may need to consider using `Character.isSurrogate(char ch)` as well as
`Character.isSurrogatePair(char high, char low)` for all characters preceding
our *findable* code point. Granted, that adds an *O(n)* multiplier on our
method's efficiency pushing us to *O(n2)*. It feels like only then
can we be absolutely certain that we are not over counting using *code units*
as opposed to *code points*.
If indeed we do move this direction, we should be quite clear, in the
javadoc, that there is a notable performance reduction when operating outside
the "Basic Multilingual Plane" (ref. [Oracle's Character
documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary)).
@PascalSchumacher - you have any thoughts here as well?
> Clarify or improve behaviour of int-based methods in StringUtils
>
>
> Key: LANG-1300
> URL: https://issues.apache.org/jira/browse/LANG-1300
> Project: Commons Lang
> Issue Type: Improvement
> Components: lang.*
>Affects Versions: 3.5
>Reporter: Duncan Jones
>Priority: Minor
> Fix For: Discussion
>
>
> The following methods use an {{int}} to represent a search character:
> {code:java}
> boolean contains(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar, final int startPos)
> int lastIndexOf(final CharSequence seq, final int searchChar)
> int lastIndexOf(final CharSequence seq, final int searchChar, final int
> startPos)
> {code}
> When I see an {{int}} representing a character, I tend to assume the method
> can handle supplementary characters. However, the current behaviour of these
> methods depends upon whether the {{CharSequence}} is a {{String}} or not.
> {code:java}
> StringBuilder builder = new StringBuilder();
> builder.appendCodePoint(0x2070E);
> System.out.println(StringUtils.lastIndexOf(builder, 0x2070E)); // -1
> System.out.println(StringUtils.lastIndexOf(builder.toString(), 0x2070E)); // 0
> {code}
> The Javadoc for these methods are ambiguous on this point, stating:
> {quote}
> This method uses {{String.lastIndexOf(int)}} if possible.
> {quote}
> I think we should consider updating the {{CharSequenceUtils}} methods used by
> this class to convert all {{CharSequence}} parameters to strings, enabling
> full code point support. The docs could be updated to make this crystal clear.
> There is a question of whether this breaks backwards compatibility.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)