Re: [MarkLogic Dev General] Issue with special / foreign language characters in ML rest search

Mary Holstege Wed, 18 Nov 2015 09:06:42 -0800

MarkLogic doesn't index punctuation characters (Unicode class P) except  
for "exact" value queries.


Therefore a word query or a value query that does not have the "exact"  
option cannot be resolved precisely by the index, only by the filter. So  
the index returns false positives and if you want precise answers you need  
to use filtered search. "punctuation sensitive" by itself doesn't change  
this dynamic.

© is a punctuation character, which is why you see precise answers only in  
filtered seaches (and at great cost).

√ on the other hand, is not a punctuation character but a symbol (Unicode  
class S) and symbols are indexed as word in their own right, which is why  
you see precise answers even in unfiltered searches.

If you are doing searches within the scope of particular elements, you  
could set up a field with a tokenizer override to reclassify certain  
punctuation characters (such as ©) as symbols instead. This only applies  
in the context of the field, but you can then do a field-word-query or  
field-value-query that would be accurate out of the index.

By the way, the Uniview application (http://r12a.github.io/uniview) is a  
handy place to lookup particular characters and see what their Unicode  
classification is.

//Mary
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Issue with special / foreign language characters in ML rest search

Reply via email to