Using fields won't be an option for our usage case, but arranging things to use 
value queries may be.

Is it possible to re-classify these characters as symbols or words, without 
using field tokenizer overrides? For example, by modifying the tokenizer.xml 
file?

Wissam

-----Original Message-----
From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Mary Holstege
Sent: 29 June 2016 17:42
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] word-query including punctuation characters

On Wed, 29 Jun 2016 08:06:35 -0700, Wissam Asfahani (TSO GB) 
<wissam.asfah...@tso.co.uk> wrote:

> Good afternoon,
>
> We are having some issues estimating the number of documents when
> performing word queries containing punctuation characters.
>
> I have attached 4 sample documents. When using the below query, the
> estimate returns 3 and the count 1.
>
> Are there any db configuration settings we can use to ensure a more
> accurate estimate result?
>
>
> let $query := cts:word-query("4µ", ("exact"), 2)
>
> return
>   (
>     xdmp:estimate(cts:search(fn:doc(), $query)),
>     fn:count(cts:search(fn:doc(), $query))
>   )
>
>
> Wissam Asfahani
> XML Developer
>

Punctuation is not indexed in the word query indexes. An exact unwildcarded 
*value* query will consider punctuation, so if you can arrange things so that 
you can use a value query, that could be a solution. If it is just this 
character and searching for it in this way is confined to identifiable parts of 
the document, you could use field tokenizer overrides to redefine µ as a word  
or symbol character for that field.  But it looks like it is being classified 
as a punctuation mark in
error: it should be classified as a letter character anyway since it is listed 
as Ll in the Unicode tables.

//Mary
_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

________________________________________________________________________
This e-mail has been scanned for all viruses by Claranet. The service is 
powered by MessageLabs. For more information on a proactive anti-virus service 
working around the clock, around the globe, visit:
http://www.claranet.co.uk
________________________________________________________________________


GOGREEN Climate Protection with DHL: please consider your environmental 
responsibility before printing this email.

This email is intended exclusively for the individual or entity to which it is 
addressed. This communication may contain information that is proprietary, 
privileged or confidential. If you are not the named addressee, you are not 
authorized to read, print, retain, copy or disseminate this message or any part 
of it. If you have received this message in error, please notify the sender 
immediately by email and delete all copies of the message.
_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to