Hi Mary,
I have investigated and see that currency symbols(Unicode Class S) are not
indexed.
$, ¢, £, ¥
I also find that some punctuations (unicode class P) are indexed and are
returned in the search.
Example: ❫ MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT U+276B
❵ MEDIUM RIGHT CURLY BRACKET ORNAMENT U+2775
Some characters like %, # are not searchable.
I have researched about the Unicode class characters and found the below link .
http://www.fileformat.info/info/unicode/category/index.htm
Could you confirm, if code starting with P (pc,pd,pe,pf etc) are all
punctuation characters (Unicode class P) and Code with S (sc, sk,sm) are all
symbols in the above URL.
To help resolve my issue, could you also please reply with all the Unicode
class S characters that are indexed by MarkLogic?
thanks
Rahul
-----Original Message-----
From: Mary Holstege [mailto:[email protected]]
Sent: Wednesday, November 18, 2015 10:35 PM
To: [email protected]; Kataram, Rahul (Cognizant)
Subject: Re: [MarkLogic Dev General] Issue with special / foreign language
characters in ML rest search
MarkLogic doesn't index punctuation characters (Unicode class P) except for
"exact" value queries.
Therefore a word query or a value query that does not have the "exact"
option cannot be resolved precisely by the index, only by the filter. So the
index returns false positives and if you want precise answers you need to use
filtered search. "punctuation sensitive" by itself doesn't change this dynamic.
© is a punctuation character, which is why you see precise answers only in
filtered seaches (and at great cost).
√ on the other hand, is not a punctuation character but a symbol (Unicode class
S) and symbols are indexed as word in their own right, which is why you see
precise answers even in unfiltered searches.
If you are doing searches within the scope of particular elements, you could
set up a field with a tokenizer override to reclassify certain punctuation
characters (such as ©) as symbols instead. This only applies in the context of
the field, but you can then do a field-word-query or field-value-query that
would be accurate out of the index.
By the way, the Uniview application (http://r12a.github.io/uniview) is a handy
place to lookup particular characters and see what their Unicode classification
is.
//Mary
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email,
and/or any action taken in reliance on the contents of this e-mail is strictly
prohibited and may be unlawful. Where permitted by applicable law, this e-mail
and other e-mail communications sent to and from Cognizant e-mail addresses may
be monitored.
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general