: Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ? ... : > An example of the string in Chinese is 预支款管理及账务处理办法 : > : > The number of characters is 12, but the expected length should be 36. ... : >> > So this would likely be different from what the operating system : >> counts, as : >> > the operating system may consider each Chinese characters as 3 to 4 : >> bytes. : >> > Which is probably why I could not find any record with : >> subject:/.{255,}.*/
Java regexes operate on unicode strings, so ".' matches any *character* There is no regex syntax to match an any "byte" so a regex based approach is never going to be viable. You're best bet is to check the byte count when indexing -- but even then you'd need some custom code since things like FieldLengthUpdateProcessorFactory are well behaved and count the *characters* of the unicode strings. If you absolutely can't reindex, then you'd need a custom QParser that produced a custom Query object that iterated over the TermEnum looking at the buffers and counting the bytes in each term -- matching each doc assocaited with those terms. -Hoss http://www.lucidworks.com/