[ 
https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dianshu Liao updated CODEC-330:
-------------------------------
    Description: 
Method: org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String 
input)

 
h1. Problem

 

The private method {{cleanup(final String input)}} in {{DaitchMokotoffSoundex}} 
is responsible for sanitizing the input string before the phonetic encoding is 
applied. While it correctly removes whitespace and performs ASCII folding, it 
does *not* remove non-letter special characters such as {{{}${}}}, {{{}@{}}}, 
{{{}#{}}}, {{{}!{}}}, or digits. These characters remain in the cleaned string.

As a result, special characters may interfere with phonetic rule matching in 
downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially 
leading to incorrect or inconsistent results.

For example, cleanup("Hello$World") -> "hello$world"

The dollar sign ({{{}${}}}) should have been removed, but it remains in the 
result.

The expected result should be "helloworld"

 

 
h1. Suggested Fix

 

Modify the {{cleanup()}} method to include a check for non-letter characters:

if (!Character.isLetter(ch))

{     continue; // Ignore non-letter characters like $, @, -, etc. }

This small change will make the method more robust when processing real-world 
input strings that may contain unexpected non-letter characters.

 

 
h1. Additional Context

 

This issue was identified during unit testing using JUnit 5. After applying the 
above fix, all test cases involving inputs with special characters pass 
successfully. Without the fix, inputs containing unexpected symbols produce 
inconsistent results.

  was:
Method: org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String 
input)

 
h1. Problem

 

The private method {{cleanup(final String input)}} in {{DaitchMokotoffSoundex}} 
is responsible for sanitizing the input string before the phonetic encoding is 
applied. While it correctly removes whitespace and performs ASCII folding, it 
does *not* remove non-letter special characters such as {{{}${}}}, {{{}@{}}}, 
{{{}#{}}}, {{{}!{}}}, or digits. These characters remain in the cleaned string.

As a result, special characters may interfere with phonetic rule matching in 
downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially 
leading to incorrect or inconsistent results.

For example, cleanup("Hello$World") -> "hello$world"

The dollar sign ({{{}${}}}) should have been removed, but it remains in the 
result.

The expected result should be "helloworld"

 

 

Suggested Fix

Modify the {{cleanup()}} method to include a check for non-letter characters:

if (!Character.isLetter(ch)) {
    continue; // Ignore non-letter characters like $, @, digits, etc.
}


> org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does 
> not remove special characters (e.g., punctuation)
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CODEC-330
>                 URL: https://issues.apache.org/jira/browse/CODEC-330
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>         Environment: JDK 8, MacOS
>            Reporter: Dianshu Liao
>            Priority: Major
>         Attachments: Screenshot 2025-05-19 at 1.01.11 am.png
>
>
> Method: 
> org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input)
>  
> h1. Problem
>  
> The private method {{cleanup(final String input)}} in 
> {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string 
> before the phonetic encoding is applied. While it correctly removes 
> whitespace and performs ASCII folding, it does *not* remove non-letter 
> special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or 
> digits. These characters remain in the cleaned string.
> As a result, special characters may interfere with phonetic rule matching in 
> downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially 
> leading to incorrect or inconsistent results.
> For example, cleanup("Hello$World") -> "hello$world"
> The dollar sign ({{{}${}}}) should have been removed, but it remains in the 
> result.
> The expected result should be "helloworld"
>  
>  
> h1. Suggested Fix
>  
> Modify the {{cleanup()}} method to include a check for non-letter characters:
> if (!Character.isLetter(ch))
> {     continue; // Ignore non-letter characters like $, @, -, etc. }
> This small change will make the method more robust when processing real-world 
> input strings that may contain unexpected non-letter characters.
>  
>  
> h1. Additional Context
>  
> This issue was identified during unit testing using JUnit 5. After applying 
> the above fix, all test cases involving inputs with special characters pass 
> successfully. Without the fix, inputs containing unexpected symbols produce 
> inconsistent results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to