Answering myself for next generations' sake. Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS does the job.
Example: import junit.framework.Assert; import org.junit.Test; public class DetectCJK { @Test public void test1() { Assert.assertEquals(Character.UnicodeBlock.BASIC_LATIN, Character.UnicodeBlock.of('a')); Assert.assertEquals(Character.UnicodeBlock.HEBREW, Character.UnicodeBlock.of('א')); Assert.assertEquals("Traditional Chinese: Electricity", Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS, Character.UnicodeBlock.of('電')); Assert.assertEquals("Simplified Chinese: Electricity", Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS, Character.UnicodeBlock.of('电')); Assert.assertEquals("Simplified Chinese: Japanese", Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS, Character.UnicodeBlock.of('電')); String chineseWritingStr = "漢字/汉字"; int length = chineseWritingStr.codePointCount(0, chineseWritingStr.length()-1); for (int i=0; i<length; i++) { int codePoint = chineseWritingStr.codePointAt(0); Assert.assertEquals("Chinese: Chinese writing", Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS, Character.UnicodeBlock.of(codePoint)); } } } On Fri, Feb 22, 2013 at 12:51 AM, Gili Nachum <gilinac...@gmail.com> wrote: > Hello, Is there anything in the Lucene core/contrib that could help detect > if a keyword is CJK or not? > I was thinking that an okay heuristic might be to inspect if the keyword's > characters unicode value is within CJK ranges. Anything that does that? > > I'm seeing really bad performance when users query for keywords with a > wildcard (say: "abc*") . Therefore, as a defensive measure, I plan to > restrict wildcard queries to have a minimum of 4 characters (e.g., reject > "abc*" allow "abcd*"). > However, for CJK keywords, I would like to make an exception, since in CJK > just one or two letters stand for a distinct word (I'm okay that some CJK > characters are not words, but are phonetic in nature). > > Thanks. > Gili. >