[
https://issues.apache.org/jira/browse/LANG-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhongxin Yan updated LANG-1802:
-------------------------------
Description:
*1. Background*
The CharRange class (package: org.apache.commons.lang3) is a core component of
the CharSet utility, and its hashCode() method is frequently invoked in
scenarios such as HashMap/HashSet storage. The current implementation of
hashCode() (linear combination) has severe hash collision problems and
suboptimal calculation efficiency, which affects the performance of upper-layer
applications.
{code:java}
//Current hashCode implementation
@Override
public int hashCode() {
return 83 + start + 7 * end + (negated ? 1 : 0);
}{code}
h5. *2. Problem Description*
The current {{hashCode()}} implementation has two critical issues:
* {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 * end
+ flag{}}}) easily causes hash value offset. For example:
** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true) →
hash = 83+1+14+1=99
** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) →
hash = 83+2+14=99
These two logically distinct instances (equals() returns false) have the same
hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
* {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations
(1 multiplication + 3 additions), which are slower than bitwise operations and
fail to utilize the 32-bit space of {{int}} efficiently.
{code:java}
// hash collision cases
@Test
void testHashCodeCollision() {
// case A:hash=99
CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true →
83+1+14+1=99
CharRange a2 = CharRange.isIn((char)2, (char)2); // 2,2,false →
83+2+14+0=99
assertEquals(a1.hashCode(), a2.hashCode()); // Collision
// case B:hash=123
CharRange b1 = CharRange.isIn((char)5, (char)5); //5,5,false
→83+5+35+0=123
CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true
→83+4+35+1=123
assertEquals(b1.hashCode(),b2.hashCode()); //Collision
} {code}
*3. Proposed Solution*
Replace the current linear combination implementation with a bitwise splicing +
XOR flag scheme, which maximizes the use of 32-bit int space, minimizes
collision rate, and optimizes calculation efficiency:[Github
PR|https://github.com/apache/commons-lang/pull/1524] [~ggregory]
was:
*1. Background*
The CharRange class (package: org.apache.commons.lang3) is a core component of
the CharSet utility, and its hashCode() method is frequently invoked in
scenarios such as HashMap/HashSet storage. The current implementation of
hashCode() (linear combination) has severe hash collision problems and
suboptimal calculation efficiency, which affects the performance of upper-layer
applications.
{code:java}
//Current hashCode implementation
@Override
public int hashCode() {
return 83 + start + 7 * end + (negated ? 1 : 0);
}{code}
h5. *2. Problem Description*
The current {{hashCode()}} implementation has two critical issues:
* {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 * end
+ flag{}}}) easily causes hash value offset. For example:
** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true) →
hash = 83+1+14+1=99
** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) →
hash = 83+2+14=99
These two logically distinct instances (equals() returns false) have the same
hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
* {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations
(1 multiplication + 3 additions), which are slower than bitwise operations and
fail to utilize the 32-bit space of {{int}} efficiently.
{code:java}
// hash collision cases
@Test
void testHashCodeCollision() {
// case A:hash=99
CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true →
83+1+14+1=99
CharRange a2 = CharRange.isIn((char)2, (char)2); // 2,2,false →
83+2+14+0=99
assertEquals(a1.hashCode(), a2.hashCode()); // Collision
// case B:hash=123
CharRange b1 = CharRange.isIn((char)5, (char)5); //5,5,false
→83+5+35+0=123
CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true
→83+4+35+1=123
assertEquals(b1.hashCode(),b2.hashCode()); //Collision
} {code}
*3. Proposed Solution*
Replace the current linear combination implementation with a bitwise splicing +
XOR flag scheme, which maximizes the use of 32-bit int space, minimizes
collision rate, and optimizes calculation efficiency:
[github PR|[https://github.com/apache/commons-lang/pull/1524]
|https://github.com/apache/commons-lang/pull/1524]] [~ggregory]
> [Improvement] Optimize CharRange.hashCode() to reduce collision rate and
> improve calculation efficiency
> -------------------------------------------------------------------------------------------------------
>
> Key: LANG-1802
> URL: https://issues.apache.org/jira/browse/LANG-1802
> Project: Commons Lang
> Issue Type: Improvement
> Components: lang.*
> Affects Versions: 3.20.0
> Reporter: Zhongxin Yan
> Priority: Major
>
> *1. Background*
> The CharRange class (package: org.apache.commons.lang3) is a core component
> of the CharSet utility, and its hashCode() method is frequently invoked in
> scenarios such as HashMap/HashSet storage. The current implementation of
> hashCode() (linear combination) has severe hash collision problems and
> suboptimal calculation efficiency, which affects the performance of
> upper-layer applications.
> {code:java}
> //Current hashCode implementation
> @Override
> public int hashCode() {
> return 83 + start + 7 * end + (negated ? 1 : 0);
> }{code}
> h5. *2. Problem Description*
> The current {{hashCode()}} implementation has two critical issues:
> * {*}High collision rate{*}: The linear combination ({{{}83 + start + 7 *
> end + flag{}}}) easily causes hash value offset. For example:
> ** {{CharRange.isNotIn((char) 1, (char) 2)}} (start=1, end=2, negated=true)
> → hash = 83+1+14+1=99
> ** {{CharRange.isIn((char) 2, (char) 2)}} (start=2, end=2, negated=false) →
> hash = 83+2+14=99
> These two logically distinct instances (equals() returns false) have the
> same hash code, leading to serious bucket conflicts in {{{}HashMap{}}}.
> * {*}Low calculation efficiency{*}: Relies on multiple arithmetic operations
> (1 multiplication + 3 additions), which are slower than bitwise operations
> and fail to utilize the 32-bit space of {{int}} efficiently.
> {code:java}
> // hash collision cases
> @Test
> void testHashCodeCollision() {
> // case A:hash=99
> CharRange a1 = CharRange.isNotIn((char)1, (char)2); // 1,2,true →
> 83+1+14+1=99
> CharRange a2 = CharRange.isIn((char)2, (char)2); // 2,2,false →
> 83+2+14+0=99
> assertEquals(a1.hashCode(), a2.hashCode()); // Collision
> // case B:hash=123
> CharRange b1 = CharRange.isIn((char)5, (char)5); //5,5,false
> →83+5+35+0=123
> CharRange b2 = CharRange.isNotIn((char)4, (char)5); //4,5,true
> →83+4+35+1=123
> assertEquals(b1.hashCode(),b2.hashCode()); //Collision
> } {code}
> *3. Proposed Solution*
> Replace the current linear combination implementation with a bitwise splicing
> + XOR flag scheme, which maximizes the use of 32-bit int space, minimizes
> collision rate, and optimizes calculation efficiency:[Github
> PR|https://github.com/apache/commons-lang/pull/1524] [~ggregory]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)