Hi Laurent,
If I follow your design correctly, what I would do is the following:
1) iterate over all your paragraphs and use xdmp:md5() to generate a hash value
2) add this hash value as an attribute to each paragraph, e.g. <paragraph
hash-id="abc123">hello world</paragraph>
3) create a string range index in the codepoint collation on the
paragraph/@hash-id attribute
Then to return paragraphs in frequency order, you can call
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency").
You can filter this list with any search expression by adding another the
cts:query as another option (see below).
This approach allows you to quickly get the hash-id in frequency order, with or
without a cts:query. You'll then need to go get a paragraph that matches the
hash-id. Because there may be many, you can simply grab the first.
let $q:= "search phrase"
for $id in
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency",$q)
return element result {attribute count
{cts:frequency($id)},(//paragra...@hash-id eq $id])[1]}
Finally, before doing any of this, I would get rid of your fragmentation. You
probably don't need fields, but we can continue to talk about how they might be
useful for this task. I also don't think you need to limit to a specific
language, but that shouldn't slow things down if you want to use it (be sure to
look over our developer guide on using languages, and your server license *may*
come into play on this subject).
This should be very fast - well under a second as long as there aren't too many
paragraphs being returned. Getting the hash-ids will be resolved out of the
indexes, whereas each paragraph returned will incur a disk i/o. 100 or so
results should be sub-second.
Kelly
Message: 4
Date: Mon, 27 Jul 2009 16:11:16 +0200
From: Laurens van den Oever <[email protected]>
Subject: [MarkLogic Dev General] Sorting by the number of occurences
of a paragraph
To: [email protected]
Message-ID:
<[email protected]>
Content-Type: text/plain; charset="iso-8859-1"
Hi all,
I'm pretty new to MarkLogic, so chances are that I've made some trivial
mistake here.
I have roughly the following structure:
<manual>
<translation lang="..."><!-- no xml:lang due to legacy -->
<!-- arbritary nesting of other elements -->
<paragraph>
I have about 5000 manuals with on average 16 translations each, bringing the
total of distinct (!) paragraphs to 700000.
The goal is to stimulate content reuse from the authoring interface.
I want to show the authors about 10 paragraphs which contain a search phrase
and here it comes: ordered by the number of occurences of that paragraph in
the collection.
I assume that a distinct paragraph only occurs once in a translation.
I realize that I'm trying to achieve something close to impossible;
expecting fast results from a query that compares a large part of the db
against the whole db, but I'm amazed that I've come this far and I'd like to
see if I can get this to the next level.
I started with the following query:
(for $para in cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), "search phrase"))
let $count := xdmp:estimate(cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), $para)))
order by number($count) descending
return
<result count="{$count}">
{$para}
</result>
)[1 to 10]
There are two problems with this approach:
1. it is far too slow
2. it returns multiple occurrences of the same content
I've been able to improve performance with the following measures:
- Maximizing the number of initial search results.
- Refragmenting the database on <translation/> level.
- Made <paragraph/> the root of a field.
- Reduced the scope of the query to one language using a [...@lang="EN"]
predicate but that slowed things down.
- Simple scoring improved performance and accuracy as relevance seems to
contradict my quest for the most occurences.
To eliminate the multiple occurrences I've used fn:distinct-values, but the
downside is that it returns a string and I need the paragraph element
including all markup.
Now my new query is:
(for $p in fn:distinct-values(
cts:search(
/manual/translation//paragraph,
cts:field-word-query("paragraph", "search query"),
("score-simple"))[1 to 250])
let $count := xdmp:estimate(
cts:search(
/manual/translation//paragraph,
cts:field-word-query("paragraph", $p),
("score-simple")))
order by number($count) descending
return <result count="{$count}">{$p}</result>
)[1 to 10]
This is often very fast, but can take far too long if I happen to hit a
batch of documents/fragments that weren't hit recently.
Is there more I can do here?
Or is there a completely different aproach that may yield better results?
And how do I get mixed content results?
Thanks for reading through all this!
Kind regards,
Laurens van den Oever
Xopus BV
http://xopus.com
+31 70 4452345
KvK 27301795
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general