[ https://issues.apache.org/jira/browse/CODEC-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dianshu Liao updated CODEC-330: ------------------------------- Description: Method: org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input) h1. Problem The private method {{cleanup(final String input)}} in {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string before the phonetic encoding is applied. While it correctly removes whitespace and performs ASCII folding, it does *not* remove non-letter special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or digits. These characters remain in the cleaned string. As a result, special characters may interfere with phonetic rule matching in downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially leading to incorrect or inconsistent results. For example, cleanup("Hello$World") -> "hello$world" The dollar sign ({{{}${}}}) should have been removed, but it remains in the result. The expected result should be "helloworld" h1. Suggested Fix Modify the {{cleanup()}} method to include a check for non-letter characters: if (!Character.isLetter(ch)) { continue; // Ignore non-letter characters like $, @, -, etc. } This small change will make the method more robust when processing real-world input strings that may contain unexpected non-letter characters. h1. Additional Context This issue was identified during unit testing using JUnit 5. After applying the above fix, all test cases involving inputs with special characters pass successfully. Without the fix, inputs containing unexpected symbols produce inconsistent results. was: Method: org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input) h1. Problem The private method {{cleanup(final String input)}} in {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string before the phonetic encoding is applied. While it correctly removes whitespace and performs ASCII folding, it does *not* remove non-letter special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or digits. These characters remain in the cleaned string. As a result, special characters may interfere with phonetic rule matching in downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially leading to incorrect or inconsistent results. For example, cleanup("Hello$World") -> "hello$world" The dollar sign ({{{}${}}}) should have been removed, but it remains in the result. The expected result should be "helloworld" Suggested Fix Modify the {{cleanup()}} method to include a check for non-letter characters: if (!Character.isLetter(ch)) { continue; // Ignore non-letter characters like $, @, digits, etc. } > org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does > not remove special characters (e.g., punctuation) > ------------------------------------------------------------------------------------------------------------------------------ > > Key: CODEC-330 > URL: https://issues.apache.org/jira/browse/CODEC-330 > Project: Commons Codec > Issue Type: Bug > Affects Versions: 1.18.0 > Environment: JDK 8, MacOS > Reporter: Dianshu Liao > Priority: Major > Attachments: Screenshot 2025-05-19 at 1.01.11 am.png > > > Method: > org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input) > > h1. Problem > > The private method {{cleanup(final String input)}} in > {{DaitchMokotoffSoundex}} is responsible for sanitizing the input string > before the phonetic encoding is applied. While it correctly removes > whitespace and performs ASCII folding, it does *not* remove non-letter > special characters such as {{{}${}}}, {{{}@{}}}, {{{}#{}}}, {{{}!{}}}, or > digits. These characters remain in the cleaned string. > As a result, special characters may interfere with phonetic rule matching in > downstream methods like "{{{}soundex"{}}} and "{{{}encode"{}}}, potentially > leading to incorrect or inconsistent results. > For example, cleanup("Hello$World") -> "hello$world" > The dollar sign ({{{}${}}}) should have been removed, but it remains in the > result. > The expected result should be "helloworld" > > > h1. Suggested Fix > > Modify the {{cleanup()}} method to include a check for non-letter characters: > if (!Character.isLetter(ch)) > { continue; // Ignore non-letter characters like $, @, -, etc. } > This small change will make the method more robust when processing real-world > input strings that may contain unexpected non-letter characters. > > > h1. Additional Context > > This issue was identified during unit testing using JUnit 5. After applying > the above fix, all test cases involving inputs with special characters pass > successfully. Without the fix, inputs containing unexpected symbols produce > inconsistent results. -- This message was sent by Atlassian Jira (v8.20.10#820010)