[MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph

Laurens van den Oever Wed, 29 Jul 2009 05:30:28 -0700

Hi Kelly,

Thank you for your excellent response. Your solution seems to do exactly
what I need.


I have removed my fragmentation and field, set the hash-id attribute on all
4M paragraphs and added the attribute range index.
Unfortunately I then got an exception that no element-attribute range index
exists for the given element/attribute QNames.
I couldn't find anything wrong with my settings and localnames/namepaces.
I assume that the problem was caused by messing with the reindexing settings
while refragmenting/reindexing. Is that possible?

I've now removed the index and am waiting for the reindexing to complete.
After that I will add the index again.

>  I also don't think you need to limit to a specific language, but that
shouldn't slow things down if you want to use it

The query-trace showed that the extra predicate needed to be filtered while
the rest of the xpath could be resolved from the indexes.
I had the feeling that removing it resulted in better performance, but I've
not done any thorough testing and I had made other changes as well.

I will let you know when I have the final results.

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795

Date: Mon, 27 Jul 2009 10:34:34 -0700
From: Kelly Stirman <[email protected]>
Subject: [MarkLogic Dev General] RE: Sorting by the number of
       occurences of   a paragraph
To: "[email protected]"
       <[email protected]>

Hi Laurent,

If I follow your design correctly, what I would do is the following:

1) iterate over all your paragraphs and use xdmp:md5() to generate a hash
value
2) add this hash value as an attribute to each paragraph, e.g. <paragraph
hash-id="abc123">hello world</paragraph>
3) create a string range index in the codepoint collation on the
paragraph/@hash-id attribute

Then to return paragraphs in frequency order, you can call
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency").
You can filter this list with any search expression by adding another the
cts:query as another option (see below).

This approach allows you to quickly get the hash-id in frequency order, with
or without a cts:query. You'll then need to go get a paragraph that matches
the hash-id. Because there may be many, you can simply grab the first.


let $q:= "search phrase"
for $id in
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency",$q)
return element result {attribute count
{cts:frequency($id)},(//paragra...@hash-id eq $id])[1]}

Finally, before doing any of this, I would get rid of your fragmentation.
You probably don't need fields, but we can continue to talk about how they
might be useful for this task. I also don't think you need to limit to a
specific language, but that shouldn't slow things down if you want to use it
(be sure to look over our developer guide on using languages, and your
server license *may* come into play on this subject).

This should be very fast - well under a second as long as there aren't too
many paragraphs being returned. Getting the hash-ids will be resolved out of
the indexes, whereas each paragraph returned will incur a disk i/o. 100 or
so results should be sub-second.

Kelly


Message: 4
Date: Mon, 27 Jul 2009 16:11:16 +0200
From: Laurens van den Oever <[email protected]>
Subject: [MarkLogic Dev General] Sorting by the number of occurences
       of a    paragraph
To: [email protected]
Message-ID:
       <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hi all,
I'm pretty new to MarkLogic, so chances are that I've made some trivial
mistake here.

I have roughly the following structure:

<manual>
 <translation lang="..."><!-- no xml:lang due to legacy -->
   <!-- arbritary nesting of other elements -->
     <paragraph>

I have about 5000 manuals with on average 16 translations each, bringing the
total of distinct (!) paragraphs to 700000.
The goal is to stimulate content reuse from the authoring interface.
I want to show the authors about 10 paragraphs which contain a search phrase
and here it comes: ordered by the number of occurences of that paragraph in
the collection.
I assume that a distinct paragraph only occurs once in a translation.

I realize that I'm trying to achieve something close to impossible;
expecting fast results from a query that compares a large part of the db
against the whole db, but I'm amazed that I've come this far and I'd like to
see if I can get this to the next level.

I started with the following query:

 (for $para in cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), "search phrase"))
 let $count := xdmp:estimate(cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), $para)))
 order by number($count) descending
 return
 <result count="{$count}">
   {$para}
 </result>
 )[1 to 10]

There are two problems with this approach:
1. it is far too slow
2. it returns multiple occurrences of the same content

I've been able to improve performance with the following measures:
- Maximizing the number of initial search results.
- Refragmenting the database on <translation/> level.
- Made <paragraph/> the root of a field.
- Reduced the scope of the query to one language using a [...@lang="EN"]
predicate but that slowed things down.
- Simple scoring improved performance and accuracy as relevance seems to
contradict my quest for the most occurences.

To eliminate the multiple occurrences I've used fn:distinct-values, but the
downside is that it returns a string and I need the paragraph element
including all markup.
Now my new query is:

 (for $p in fn:distinct-values(
   cts:search(
     /manual/translation//paragraph,
     cts:field-word-query("paragraph", "search query"),
     ("score-simple"))[1 to 250])
 let $count := xdmp:estimate(
   cts:search(
     /manual/translation//paragraph,
     cts:field-word-query("paragraph", $p),
     ("score-simple")))
 order by number($count) descending
 return <result count="{$count}">{$p}</result>
)[1 to 10]

This is often very fast, but can take far too long if I happen to hit a
batch of documents/fragments that weren't hit recently.

Is there more I can do here?
Or is there a completely different aproach that may yield better results?
And how do I get mixed content results?

Thanks for reading through all this!

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph

Reply via email to