[jira] [Comment Edited] (LANG-1300) Clarify or improve behaviour of int-based methods in StringUtils

2017-03-10 Thread Rob Tompkins (JIRA)

[ 
https://issues.apache.org/jira/browse/LANG-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905219#comment-15905219
 ] 

Rob Tompkins edited comment on LANG-1300 at 3/10/17 3:11 PM:
-

Github user chtompki commented on the issue:

https://github.com/apache/commons-lang/pull/251
  
@dmjones500 - you think we should bubble this up to the dev list?

Personally I'm ok with either approach, code unit v. code point, just so long 
as we are consistent across all of the CharSequence implementations.



was (Author: githubbot):
Github user chtompki commented on the issue:

https://github.com/apache/commons-lang/pull/251
  
@dmjones500 - you think we should bubble this up to the dev list?


> Clarify or improve behaviour of int-based methods in StringUtils
> 
>
> Key: LANG-1300
> URL: https://issues.apache.org/jira/browse/LANG-1300
> Project: Commons Lang
>  Issue Type: Improvement
>  Components: lang.*
>Affects Versions: 3.5
>Reporter: Duncan Jones
>Priority: Minor
> Fix For: Discussion
>
>
> The following methods use an {{int}} to represent a search character:
> {code:java}
> boolean contains(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar, final int startPos)
> int lastIndexOf(final CharSequence seq, final int searchChar)
> int lastIndexOf(final CharSequence seq, final int searchChar, final int 
> startPos)
> {code}
> When I see an {{int}} representing a character, I tend to assume the method 
> can handle supplementary characters. However, the current behaviour of these 
> methods depends upon whether the {{CharSequence}} is a {{String}} or not.
> {code:java}
> StringBuilder builder = new StringBuilder();
> builder.appendCodePoint(0x2070E);
> System.out.println(StringUtils.lastIndexOf(builder, 0x2070E)); // -1
> System.out.println(StringUtils.lastIndexOf(builder.toString(), 0x2070E)); // 0
> {code}
> The Javadoc for these methods are ambiguous on this point, stating:
> {quote}
> This method uses {{String.lastIndexOf(int)}} if possible.
> {quote}
> I think we should consider updating the {{CharSequenceUtils}} methods used by 
> this class to convert all {{CharSequence}} parameters to strings, enabling 
> full code point support. The docs could be updated to make this crystal clear.
> There is a question of whether this breaks backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (LANG-1300) Clarify or improve behaviour of int-based methods in StringUtils

2017-03-07 Thread Rob Tompkins (JIRA)

[ 
https://issues.apache.org/jira/browse/LANG-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899421#comment-15899421
 ] 

Rob Tompkins edited comment on LANG-1300 at 3/7/17 1:20 PM:


Github user chtompki commented on the issue:

https://github.com/apache/commons-lang/pull/251
  
@dmjones500 - no worries on the being busy, we all end up there for time to 
time... :-) 

@dmjones500 has an interesting point. The problem seems to lie with the 
number of "Supplementary Code Points" preceding the *findable* `searchChar` 
that have been previously split into their complementary surrogate pairs.  

You may need to consider using `Character.isSurrogate(char ch)` as well as 
`Character.isSurrogatePair(char high, char low)` for all characters preceding 
our *findable* code point. Granted, that adds an *O( n )* multiplier on our 
method's efficiency pushing us to *O(n2)*. It feels like only then 
can we be absolutely certain that we are not over counting using *code units* 
as opposed to *code points*. 

If indeed we do move this direction, we should be quite clear, in the 
javadoc, that there is a notable performance reduction when operating outside 
the "Basic Multilingual Plane" (ref. [Oracle's Character 
documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary)).

@PascalSchumacher - you have any thoughts here as well?



was (Author: githubbot):
Github user chtompki commented on the issue:

https://github.com/apache/commons-lang/pull/251
  
@dmjones500 - no worries on the being busy, we all end up there for time to 
time... :-) 

@dmjones500 has an interesting point. The problem seems to lie with the 
number of "Supplementary Code Points" preceding the *findable* `searchChar` 
that have been previously split into their complementary surrogate pairs.  

You may need to consider using `Character.isSurrogate(char ch)` as well as 
`Character.isSurrogatePair(char high, char low)` for all characters preceding 
our *findable* code point. Granted, that adds an *O(n)* multiplier on our 
method's efficiency pushing us to *O(n2)*. It feels like only then 
can we be absolutely certain that we are not over counting using *code units* 
as opposed to *code points*. 

If indeed we do move this direction, we should be quite clear, in the 
javadoc, that there is a notable performance reduction when operating outside 
the "Basic Multilingual Plane" (ref. [Oracle's Character 
documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary)).

@PascalSchumacher - you have any thoughts here as well?


> Clarify or improve behaviour of int-based methods in StringUtils
> 
>
> Key: LANG-1300
> URL: https://issues.apache.org/jira/browse/LANG-1300
> Project: Commons Lang
>  Issue Type: Improvement
>  Components: lang.*
>Affects Versions: 3.5
>Reporter: Duncan Jones
>Priority: Minor
> Fix For: Discussion
>
>
> The following methods use an {{int}} to represent a search character:
> {code:java}
> boolean contains(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar, final int startPos)
> int lastIndexOf(final CharSequence seq, final int searchChar)
> int lastIndexOf(final CharSequence seq, final int searchChar, final int 
> startPos)
> {code}
> When I see an {{int}} representing a character, I tend to assume the method 
> can handle supplementary characters. However, the current behaviour of these 
> methods depends upon whether the {{CharSequence}} is a {{String}} or not.
> {code:java}
> StringBuilder builder = new StringBuilder();
> builder.appendCodePoint(0x2070E);
> System.out.println(StringUtils.lastIndexOf(builder, 0x2070E)); // -1
> System.out.println(StringUtils.lastIndexOf(builder.toString(), 0x2070E)); // 0
> {code}
> The Javadoc for these methods are ambiguous on this point, stating:
> {quote}
> This method uses {{String.lastIndexOf(int)}} if possible.
> {quote}
> I think we should consider updating the {{CharSequenceUtils}} methods used by 
> this class to convert all {{CharSequence}} parameters to strings, enabling 
> full code point support. The docs could be updated to make this crystal clear.
> There is a question of whether this breaks backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)