Zhongxin Yan created LANG-1802:
----------------------------------
Summary: [Improvement] Optimize CharRange.hashCode() to reduce
collision rate and improve calculation efficiency
Key: LANG-1802
URL: https://issues.apache.org/jira/browse/LANG-1802
Project: Commons Lang
Issue Type: Improvement
Components: lang.*
Affects Versions: 3.20.0
Reporter: Zhongxin Yan
*1. Background*
The CharRange class (package: org.apache.commons.lang3) is a core component of
the CharSet utility, and its hashCode() method is frequently invoked in
scenarios such as HashMap/HashSet storage. The current implementation of
hashCode() (linear combination) has severe hash collision problems and
suboptimal calculation efficiency, which affects the performance of upper-layer
applications.
{code:java}
//Current hashCode implementation
@Override
public int hashCode() {
return 83 + start + 7 * end + (negated ? 1 : 0);
}{code}
h5. *2. Problem Description*
The current {{hashCode()}} implementation has two critical issues:
* {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 * end
+ flag{}}}) easily causes hash value offset. For example:
** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true) →
hash = 83+1+14+1=99
** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) →
hash = 83+2+14=99
These two logically distinct instances (equals() returns false) have the same
hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
* {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations
(1 multiplication + 3 additions), which are slower than bitwise operations and
fail to utilize the 32-bit space of {{int}} efficiently.
{code:java}
// hash collision cases
@Test
void testHashCodeCollision() {
// case A:hash=99
CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true →
83+1+14+1=99
CharRange a2 = CharRange.isIn((char)2, (char)2); // 2,2,false →
83+2+14+0=99
assertEquals(a1.hashCode(), a2.hashCode()); // Collision
// case B:hash=123
CharRange b1 = CharRange.isIn((char)5, (char)5); //5,5,false
→83+5+35+0=123
CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true
→83+4+35+1=123
assertEquals(b1.hashCode(),b2.hashCode()); //Collision
} {code}
*3. Proposed Solution*
Replace the current linear combination implementation with a bitwise splicing +
XOR flag scheme, which maximizes the use of 32-bit int space, minimizes
collision rate, and optimizes calculation efficiency:
[github PR|[https://github.com/apache/commons-lang/pull/1524]
|https://github.com/apache/commons-lang/pull/1524]] [~ggregory]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)