[ 
https://issues.apache.org/jira/browse/LANG-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188393#comment-17188393
 ] 

Sebb commented on LANG-1606:
----------------------------

There are two ways to count matches: overlapping or non-overlapping.
The code currently correctly counts non-overlapping matches.

I agree that the Javadoc is not clear on this, however I don't think it is 
actually wrong

Given that users may rely on the current behaviour, generally it is the Javadoc 
that must be changed rather than the code.

> StringUtils.countMatches returns incorrect value while handling intersecting 
> substrings
> ---------------------------------------------------------------------------------------
>
>                 Key: LANG-1606
>                 URL: https://issues.apache.org/jira/browse/LANG-1606
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>    Affects Versions: 3.11
>            Reporter: Rustem Galiev
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Steps to reproduce:
> 1. Call the method like that:
> {code:java}
> int count = StringUtils.countMatches("abaabaababaab", "aba");
> {code}
> Actual result: the value of count variable equals 3
>  Expected result: the value of count variable equals 4
> The substrings are highlighted in red:
>  {color:#ff0000}aba{color}abaababaab
>  aba{color:#ff0000}aba{color}ababaab
>  abaaba{color:#ff0000}aba{color}baab
>  abaabaab{color:#ff0000}aba{color}ab
> Method returns incorrect value because of this code:
> {code:java}
> while ((idx = CharSequenceUtils.indexOf(str, sub, idx)) != INDEX_NOT_FOUND) {
>     count++;
>     idx += sub.length();
> }
> {code}
> This looks like a greedy algorithm - but increasing the idx variable by the 
> length of substring could lead to the problems like in example:
> Let's say that idx = 6, so we try to find a substring in the highlighted 
> suffix:
>  abaaba{color:#ff0000}ababaab{color}
> We found the substring, so idx now becomes idx + 3 = 9. So now this suffix 
> will be used for searching substring in it:
>  abaabaaba{color:#ff0000}baab{color}
>  But because of increasing the value of idx by 3 we won't find the substring 
> (abaabaab{color:#ff0000}aba{color}ab) which intersects with the already found 
> substring on the last step.
> Basically, this method will work incorrectly with any substrings that 
> intersect with each other.
> There is also a unit test with incorrect expected value:
> {code:java}
> assertEquals(4,
>      StringUtils.countMatches("oooooooooooo", "ooo"));
> {code}
> If this behavior (counting substrings that do not intersect) is intended, 
> please update the JavaDoc to reflect it. Right now it looks like that:
> {code:java}
> Counts how many times the substring appears in the larger string.
> {code}
> Link for the PR: https://github.com/apache/commons-lang/pull/615



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to