[MarkLogic Dev General] RE: Whitespace, Punctuation, Collations,and& (Tim Meagher)

Kelly Stirman Fri, 26 Jun 2009 12:00:25 -0700

Tim,

You can include cts:query arguments in your call to cts:element-value-match(), 
including cts:directory-query() and cts:collection-query():

cts:element-value-match(
        $element-names as xs:QName*,
        $pattern as xs:anyAtomicType,
        [$options as xs:string*],
        [$query as cts:query?],
        [$quality-weight as xs:double?],
        [$forest-ids as xs:unsignedLong*]
)  as  xs:anyAtomicType*

I believe using cts:element-value-match() is the best way to accomplish what 
you are trying to do. Collations can help you "fold" variations in punctuation 
or whitespace (using the collation builder), but you should only do this if you 
really don't care about the differences. :-) I say this because in your two 
examples, both values would be included as a single entry in the index, and 
when you get the entry out using element-values() or element-value-match(), the 
first one that gets in is what you'll get back (there are some other subtleties 
here, but I won't go into them).

Kelly

Message: 2
Date: Fri, 26 Jun 2009 14:22:59 -0400
From: "Tim Meagher" <[email protected]>
Subject: RE: [MarkLogic Dev General] Whitespace, Punctuation,
        Collations,and&amp;
To: "'General Mark Logic Developer Discussion'"
        <[email protected]>
Cc: 'Paul Rooney' <[email protected]>
Message-ID: <005f01c9f68b$242b3bf0$0601a...@grace>
Content-Type: text/plain; charset="us-ascii"

Hi Mary,

Thank you for the explanation.  It sheds a lot of light on the restricted
use of sub-query options associated with lexicon searches due to the nature
of unfiltered searches.  Just to make sure I understand you correctly, I
want to rephrase what I'm hearing and ask if I have it right:

When using cts:element-value-query (or other search constructors such as
cts:element-word-query) as a sub-query of lexicon searches, always specify
the "punctuation-sensitive" and "whitespace-sensitive" options.  The
following options are allowable: "stemmed" and "unstemmed",
"case-insensitive" and "case-sensitive", "diacritic-insensitive" and
"diacritic-sensitive", and "wildcarded" and "unwildcarded".  (I do need
clarification about the expected results when using an asterisk directly
after a word, such as "Personality*" in a wildcarded search and if that is
supposed to create any problems.)

One of the reasons that I chose to use cts:element-values instead of
cts:element-value-match is that I was informed that I can use the
cts:and-query as a subquery and string together additional filters to narrow
down the lexicon search results.  For example, if my element range index
covers data in multiple directory URIs, then I should be able to add a
cts:directory-query to restrict the results to fragments in the appropriate
directory URI.  I hope to be able to do the same using collections as
filters.

The XML documents with which I'm working do not use namespaces.  Employing
namespaces would help to avoid conflicts between datasets in the same
database that share common element names, but I was told I can use directory
and collection filters in the lexicon search subquery to alleviate the need
for namespaces.  However, I'm more than ready to implement namespaces if I
find that there are certain limitations of not using them when it comes to
lexicon (including co-occurrence) searches.

The disadvantage of using cts:element-value-match is that subqueries cannot
be provided (as in cts:element-values) to further restrict the results of
the lexicon search.  That imposes limitations on including XML documents in
the same database that have no namespace and which employ the same element
name as an element range index.  I get around this by using different
databases, which is not too bad because there is a lot of data.  I can
assign different forests and segregate the physical storage if and as
necessary, but inter-database queries become a bottleneck when it comes to
performance.

I did find a relatively simple way to add a filtered search features to the
lexicon search results by using cts:contains with the desired filtered
subquery (i.e. cts:element-value-query with "punctuation-insensitive" and
"whitespace-insensitive").

Would you clarify what you mean by using positions or phrase terms?  How
would I configure these to ensure that a search (wildcarded or uncarded)
will only provide results in the order of the request terms?

Thanks again!

Tim Meagher

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Mary Holstege
Sent: Thursday, June 25, 2009 3:16 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Whitespace, Punctuation,
Collations,and&amp;

OK, let me try to explain a little less tersely.

When you do a search in MarkLogic, there are two phases involved:

selecting fragments using available indices, and then filtering those

fragments to return only those that actually match.  Depending on

the query and the available indices, the index resolution part may

return false positives, which the filter will then reject to ensure that

the results of the query are correct.  For example, if you look for the

phrase "a simple example" and don't have positions or phrase terms,

the index resolution will return any fragment that has "a" and "simple"

and "example", regardless of whether they are right next to each other

or not, because the word index doesn't know any of that.

So, the issue here is that when it comes to value queries only those with

the option "exact" (or the equivalent list of case-sensitive,

diacritic-sensitive,

punctuation-sensitive, and whitespace-sensitive) are capable of

distinguishing

between a value of "Personality & Individ. Diff" and "Personality Individ.

Diff"

directly from the indexes.  If you were to do an estimate on the

whitespace-insensitive element-value search you would see that it is larger

than a count on the same search.  In this case it has nothing to do with

your fragment roots, it has to do with the way we index element values

and punctuation.  There is nothing wrong with your configuration in this

case.

Since the query argument to cts:element-values is not filtered, if you

use a query that is not accurately resolvable directly from the indexes,

you

will see the behaviour you describe, where the query will return fragments

that the filter would have rejected.

So, your options are to make sure you only give element-values a query

that is accurately resolvable from the indexes you have enabled (for a

value

query this means "exact" or not having either whitespace or punctuation)

or to accept that you are going to get some false matches.  The way you can

check this is by executing the query all by itself and seeing whether

fn:count

and xdmp:estimate give the same answer.

You might also try using cts:element-value-match instead.  It looks like

your query is trying to apply a match-kind of constraint on the lexicon,

and cts:element-value-match lets you get at that more directly.

//Mary

On Thu, 25 Jun 2009 11:44:28 -0700, Tim Meagher <[email protected]> wrote:

> Hi Mary,

>

>

> I'm not sure that I entirely follow you on this.  First of all, the use

> of

> whitespace-sensitive versus whitespace-insensitive is making a

> difference,

> which leads me to believe that it is affecting the results of an

> unfiltered

> search.  Secondly, I understand that the lexicon searches are performed

> against fragments and not documents, so I set my fragment root to the

> element level at which only one instance of the JournalTitle element is

> defined.  This leads me to believe that the frequency counts obtained

> against any JournalTitle in the lexicon are exact counts.

>

>

> I would like to get a better idea of what you meant by a non-exact .

> value

> query as related to an unfiltered search.  I figure there has to be a

> certain degree of determinism involved in such a case, that the same

> request

> will continue to yield the same results, and that some (if not all)

> degrees

> of non-exact results can be avoided by doing things such as configuring

> the

> fragment root appropriately, and using cts:element-value-query as a

> subquery

> as if it were acting as close as possible to a filtered search.

>

>

> My goal in all of this it to create reports that have counts obtained

> from

> the lexicon frequency counts (using cts:frequency) and from

> co-occurrences

> frequency counts that provide fairly accurate statistics for indexed

> elements and their relationship with other indexed elements and

> attributes.

> It makes for nice real-time reports, but if the numbers aren't exact, I'd

> like to be able to understand why, to take corrective action in code (if

> possible), and to be able to inform users when and why certain results

> are

> not reliable.

>

>

> Thanks or your response - I hope I haven't come across harshly - I'm just

> trying to make optimal use of the MarkLogic and want to understand how to

> improve my MarkLogic database configuration and coding practices.

>

>

> Tim

>

>

> -----Original Message-----

> From: [email protected]

> [mailto:[email protected]] On Behalf Of Mary

> Holstege

> Sent: Thursday, June 25, 2009 12:55 PM

> To: General Mark Logic Developer Discussion

> Subject: Re: [MarkLogic Dev General] Whitespace, Punctuation, Collations,

> and&amp;

>

>

>

> The issue here isn't the lexicon and its collation, the issue here is

>

> the element-value-query.

>

>

> The query used with a a lexicon operation in this way is executed

>

> unfiltered, and a non-exact (i.e. whitespace-insensitive)

>

> punctuation-sensitive value query does not resolve the difference

>

> between "Personality & Individ. Diff" and

>

> "Personality Individ. Diff" in the index.

>

>

> What you need to do is make sure that you only use queries that

>

> are accurate against the index (i.e. where fn:count=xdmp:estimate).

>

>

> //Mary

>

>

> On Thu, 25 Jun 2009 09:08:09 -0700, Tim Meagher <[email protected]> wrote:

>

>

>> Hi Folks,

>

>>

>

>>

>

>> I have come across an interesting phenomenon and am trying to understand

>

>> it.

>

>> I have a (range element index) lexicon configured for JournalTitle using

>

>> the

>

>> root collation which contains the following distinct values:

>

>>

>

>>

>

>> Personality & Individ. Diff.

>

>>

>

>>

>

>> and

>

>>

>

>>

>

>> Personality Individ. Diff.

>

>>

>

>>

>

>> If I perform a lexicon search for "Personality &amp; Individ. Diff."

>

>> using

>

>> cts:element-values() with a subquery of cts:element-value-query()

>

>> specifying

>

>> the equivalent of an exact match in the search options as follows:

>

>>

>

>>

>

>>

>

>> element results {

>

>>

>

>>   for $result in

>

>>

>

>>     cts:element-values(xs:QName("JournalTitle"), (),

>

>>

>

>>        ("item-frequency", "item-order", "ascending"),

>

>>

>

>>        cts:and-query((

>

>>

>

>>          cts:element-value-query(xs:QName("JournalTitle"),

>

>>

>

>>                  "Personality &amp; Individ. Diff.",

>

>>

>

>>            ("case-sensitive", "diacritic-sensitive",

>

>>

>

>>             "punctuation-sensitive", "whitespace-sensitive",

>

>>

>

>>              "unstemmed", "unwildcarded"))

>

>>

>

>>        ))

>

>>

>

>>      )[1 to 20]

>

>>

>

>>   return element result {$result}

>

>>

>

>> }

>

>>

>

>>

>

>> then I get the following results (as expected):

>

>>

>

>>

>

>>  <http://markprod.apa.org:8002/eval.xqy?iefix.txt##> - <results>

>

>>

>

>>   <result>Personality & Individ. Diff.</result>

>

>>

>

>>      </results>

>

>>

>

>>

>

>> However, if I change the request options to whitespace-insensitive, then

>

>> I

>

>> get the following results:

>

>>

>

>>

>

>>  <http://markprod.apa.org:8002/eval.xqy?iefix.txt##> - <results>

>

>>

>

>>   <result>Personality & Individ. Diff.</result>

>

>>

>

>>   <result>Personality Individ. Diff.</result>

>

>>

>

>>      </results>

>

>>

>

>>

>

>> This implies to me that the ampersand is treated like whitespace.  I

>

>> would

>

>> have expected it to be treated as punctuation, but I'm not sure exactly

>

>> what

>

>> character set (including escaped characters) are defined by whitespace

>

>> and

>

>> punctuation.  I've looked into the UCA and ISO-8859-1 specs to try to

>

>> understand the default MarkLogic root collation, but I haven't found a

>

>> simple list that would help me to understand why I'm getting the above

>

>> results.  Can anyone shed some light on this?

>

>>

>

>>

>

>> Can someone also help clarify the distinction between the default

>

>> MarkLogic

>

>> root collation (http://marklogic.com/collation) and the codepoint

>

>> collation

>

>> (http://marklogic.com/collation/codepoint)?  I'm trying to find the

>> ideal

>

>> collation for my JournalTitle lexicon collation.

>

>>

>

>>

>

>> Thanks for the help!

>

>>

>

>>

>

>> Tim Meagher - AAOM Consulting

>

>>

>

>>

>

>

>

> _______________________________________________

>

> General mailing list

>

> [email protected]

>

> http://xqzone.com/mailman/listinfo/general

>

_______________________________________________

General mailing list

[email protected]

http://xqzone.com/mailman/listinfo/general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://xqzone.marklogic.com/pipermail/general/attachments/20090626/b495b890/attachment.html

------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

End of General Digest, Vol 60, Issue 39
***************************************
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] RE: Whitespace, Punctuation, Collations,and& (Tim Meagher)

Reply via email to