From: Singh, Divya <divyasi...@siemens.com.INVALID> Sent: 04 July 2025 14:40 To: d...@lucene.apache.org Cc: Birajdar, Sharad (DI SW PLM LCS APPS ALM R&D7) <sharad.biraj...@siemens.com> Subject: FW: Challenges with Chinese Query Matching and Wildcard Search in Lucene (StandardAnalyzer / CJKAnalyzer)
From: Thakare, Monika (ext) (DI SW PLM LCS APPS ALM R&D7) <monika.thakare....@siemens.com<mailto:monika.thakare....@siemens.com>> Sent: 04 July 2025 09:56 To: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> Cc: Singh, Divya (DI SW PLM LCS APPS ALM R&D7) <divyasi...@siemens.com<mailto:divyasi...@siemens.com>> Subject: Challenges with Chinese Query Matching and Wildcard Search in Lucene (StandardAnalyzer / CJKAnalyzer) Dear Team, We're currently working with Lucene version 9.4.2, using the lucene-analysis-common-9.4.2.jar package, and have encountered inconsistencies while performing search queries on Chinese text—particularly full names like "黄朝辉". We've used Luke 9 to inspect the index and observed the following behavior: Queries that return results: * "黄", "朝", "辉" * "黄*", "朝*", "辉*" Queries that do not return results: * "黄朝", "朝辉" * "黄朝*", "朝辉*" * Full match: "黄朝辉", "黄朝辉*" It seems that compound tokens or full-name queries are not matching as expected—even with wildcards—despite successful indexing. To explore alternatives, we attempted to use CJKAnalyzer from org.apache.lucene.analysis.cjk, but encountered an Eclipse restriction: Access restriction: The type 'CJKAnalyzer' is not API We'd greatly appreciate your insight on: 1. Whether wildcard queries are supported for Chinese text using StandardAnalyzer or another analyzer. 2. How compound or full-name queries in Chinese are expected to behave, and whether specific tokenization issues might be involved. 3. The proper way to use CJKAnalyzer without encountering access restrictions, or alternative analyzers better suited for this use case. Thank you for your time and any guidance you can provide. We're open to suggestions regarding analyzer choice, configuration, or best practices for Chinese query handling Thanks and Regards Monika Thakare