Re: [MarkLogic Dev General] Issue with special / foreign language characters in ML rest search

Rahul.Kataram Wed, 02 Dec 2015 20:45:08 -0800

Hi Mary,

I have investigated and see that currency symbols(Unicode Class S)  are not 
indexed.
$, ¢, £, ¥


I also find that some punctuations (unicode class P) are indexed and are 
returned in the search.
Example: ❫      MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT U+276B
           ❵    MEDIUM RIGHT CURLY BRACKET ORNAMENT U+2775
Some characters like %, # are not searchable.

I have researched about the Unicode class  characters and found the below link .
http://www.fileformat.info/info/unicode/category/index.htm
Could you confirm, if code starting with  P (pc,pd,pe,pf etc)  are all 
punctuation characters (Unicode class P) and Code with S (sc, sk,sm) are all 
symbols in the above URL.

To help resolve my issue, could you also please reply with all the Unicode 
class S characters that are indexed by MarkLogic?

thanks
Rahul
-----Original Message-----
From: Mary Holstege [mailto:[email protected]]
Sent: Wednesday, November 18, 2015 10:35 PM
To: [email protected]; Kataram, Rahul (Cognizant)
Subject: Re: [MarkLogic Dev General] Issue with special / foreign language 
characters in ML rest search


MarkLogic doesn't index punctuation characters (Unicode class P) except for 
"exact" value queries.

Therefore a word query or a value query that does not have the "exact"
option cannot be resolved precisely by the index, only by the filter. So the 
index returns false positives and if you want precise answers you need to use 
filtered search. "punctuation sensitive" by itself doesn't change this dynamic.

© is a punctuation character, which is why you see precise answers only in 
filtered seaches (and at great cost).

√ on the other hand, is not a punctuation character but a symbol (Unicode class 
S) and symbols are indexed as word in their own right, which is why you see 
precise answers even in unfiltered searches.

If you are doing searches within the scope of particular elements, you could 
set up a field with a tokenizer override to reclassify certain punctuation 
characters (such as ©) as symbols instead. This only applies in the context of 
the field, but you can then do a field-word-query or field-value-query that 
would be accurate out of the index.

By the way, the Uniview application (http://r12a.github.io/uniview) is a handy 
place to lookup particular characters and see what their Unicode classification 
is.

//Mary
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Issue with special / foreign language characters in ML rest search

Reply via email to