[ https://issues.apache.org/jira/browse/LANG-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188393#comment-17188393 ]
Sebb commented on LANG-1606: ---------------------------- There are two ways to count matches: overlapping or non-overlapping. The code currently correctly counts non-overlapping matches. I agree that the Javadoc is not clear on this, however I don't think it is actually wrong Given that users may rely on the current behaviour, generally it is the Javadoc that must be changed rather than the code. > StringUtils.countMatches returns incorrect value while handling intersecting > substrings > --------------------------------------------------------------------------------------- > > Key: LANG-1606 > URL: https://issues.apache.org/jira/browse/LANG-1606 > Project: Commons Lang > Issue Type: Bug > Components: lang.* > Affects Versions: 3.11 > Reporter: Rustem Galiev > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Steps to reproduce: > 1. Call the method like that: > {code:java} > int count = StringUtils.countMatches("abaabaababaab", "aba"); > {code} > Actual result: the value of count variable equals 3 > Expected result: the value of count variable equals 4 > The substrings are highlighted in red: > {color:#ff0000}aba{color}abaababaab > aba{color:#ff0000}aba{color}ababaab > abaaba{color:#ff0000}aba{color}baab > abaabaab{color:#ff0000}aba{color}ab > Method returns incorrect value because of this code: > {code:java} > while ((idx = CharSequenceUtils.indexOf(str, sub, idx)) != INDEX_NOT_FOUND) { > count++; > idx += sub.length(); > } > {code} > This looks like a greedy algorithm - but increasing the idx variable by the > length of substring could lead to the problems like in example: > Let's say that idx = 6, so we try to find a substring in the highlighted > suffix: > abaaba{color:#ff0000}ababaab{color} > We found the substring, so idx now becomes idx + 3 = 9. So now this suffix > will be used for searching substring in it: > abaabaaba{color:#ff0000}baab{color} > But because of increasing the value of idx by 3 we won't find the substring > (abaabaab{color:#ff0000}aba{color}ab) which intersects with the already found > substring on the last step. > Basically, this method will work incorrectly with any substrings that > intersect with each other. > There is also a unit test with incorrect expected value: > {code:java} > assertEquals(4, > StringUtils.countMatches("oooooooooooo", "ooo")); > {code} > If this behavior (counting substrings that do not intersect) is intended, > please update the JavaDoc to reflect it. Right now it looks like that: > {code:java} > Counts how many times the substring appears in the larger string. > {code} > Link for the PR: https://github.com/apache/commons-lang/pull/615 -- This message was sent by Atlassian Jira (v8.3.4#803005)