: Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ?
        ...
: > An example of the string in Chinese is 预支款管理及账务处理办法
: >
: > The number of characters is 12, but the expected length should be 36.
        ...
: >> > So this would likely be different from what the operating system
: >> counts, as
: >> > the operating system may consider each Chinese characters as 3 to 4
: >> bytes.
: >> > Which is probably why I could not find any record with
: >> subject:/.{255,}.*/

Java regexes operate on unicode strings, so ".' matches any *character*
There is no regex syntax to match an any "byte" so a regex based approach 
is never going to be viable.

You're best bet is to check the byte count when indexing -- but even then 
you'd need some custom code since things like 
FieldLengthUpdateProcessorFactory are well behaved and count the 
*characters* of the unicode strings.

If you absolutely can't reindex, then you'd need a custom QParser that 
produced a custom Query object that iterated over the TermEnum looking at 
the buffers and counting the bytes in each term -- matching each doc 
assocaited with those terms.



-Hoss
http://www.lucidworks.com/

Reply via email to