RE: [MarkLogic Dev General] Sorting by the number of occurences of a paragraph

Laurens van den Oever Mon, 27 Jul 2009 09:24:49 -0700

Hi Geert,

Thanks for your response, your input is certainly valuable. I'll let you
know about the results.


> Thirdly, you select top ten on the outside of the for-loop. If it is
possible to get that in the for expression of your for-
> loop, that should speed things up much as well

Is there a common pattern to do that? I need the top 10 items after the
order by.

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795

Date: Mon, 27 Jul 2009 16:26:00 +0200
From: Geert Josten <[email protected]>
Subject: RE: [MarkLogic Dev General] Sorting by the number of
       occurences of   a       paragraph
To: General Mark Logic Developer Discussion
       <[email protected]>
Message-ID:
       <0260356c6dfe754ba6fa48e659a14338269cae7...@helios.olympus.borgus.nl>
Content-Type: text/plain; charset="Windows-1252"

Hi Laurens,

Have you looked into the cts:element-values and related functions? These are
purely based on the MarkLogic Server indexes and are by far quicker than
calls to distinct-values.

And not sure if it makes difference, but you could also use cts:remainder
instead of xdmp:estimate with a search as argument.

Thirdly, you select top ten on the outside of the for-loop. If it is
possible to get that in the for expression of your for-loop, that should
speed things up much as well.

Your statements about timings seem to indicate your performance is relying
on caching within MarkLogic Server, but using index based functions only
makes caching unnecessary..

HTH,
Geert

>


Drs. G.P.H. Josten
Consultant


http://www.daidalos.nl/
Daidalos BV
Source of Innovation
Hoekeindsehof 1-4
2665 JZ Bleiswijk
Tel.: +31 (0) 10 850 1200
Fax: +31 (0) 10 850 1199
http://www.daidalos.nl/
KvK 27164984
De informatie - verzonden in of met dit emailbericht - is afkomstig van
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit
bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan
dit bericht kunnen geen rechten worden ontleend.


> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Laurens van den Oever
> Sent: maandag 27 juli 2009 16:11
> To: [email protected]
> Subject: [MarkLogic Dev General] Sorting by the number of
> occurences of a paragraph
>
> Hi all,
>
> I'm pretty new to MarkLogic, so chances are that I've made
> some trivial mistake here.
>
>
> I have roughly the following structure:
>
> <manual>
>   <translation lang="..."><!-- no xml:lang due to legacy -->
>
>     <!-- arbritary nesting of other elements -->
>       <paragraph>
>
> I have about 5000 manuals with on average 16 translations
> each, bringing the total of distinct (!) paragraphs to 700000.
> The goal is to stimulate content reuse from the authoring interface.
> I want to show the authors about 10 paragraphs which contain
> a search phrase and here it comes: ordered by the number of
> occurences of that paragraph in the collection.
> I assume that a distinct paragraph only occurs once in a translation.
>
> I realize that I'm trying to achieve something close to
> impossible; expecting fast results from a query that compares
> a large part of the db against the whole db, but I'm amazed
> that I've come this far and I'd like to see if I can get this
> to the next level.
>
> I started with the following query:
>
>  (for $para in cts:search(//paragraph,
> cts:element-word-query(xs:QName("paragraph"), "search phrase"))
>   let $count := xdmp:estimate(cts:search(//paragraph,
> cts:element-word-query(xs:QName("paragraph"), $para)))
>   order by number($count) descending
>   return
>   <result count="{$count}">
>     {$para}
>   </result>
>   )[1 to 10]
>
> There are two problems with this approach:
> 1. it is far too slow
> 2. it returns multiple occurrences of the same content
>
> I've been able to improve performance with the following measures:
> - Maximizing the number of initial search results.
> - Refragmenting the database on <translation/> level.
> - Made <paragraph/> the root of a field.
> - Reduced the scope of the query to one language using a
> [...@lang="EN"] predicate but that slowed things down.
> - Simple scoring improved performance and accuracy as
> relevance seems to contradict my quest for the most occurences.
>
> To eliminate the multiple occurrences I've used
> fn:distinct-values, but the downside is that it returns a
> string and I need the paragraph element including all markup.
> Now my new query is:
>
>  (for $p in fn:distinct-values(
>     cts:search(
>       /manual/translation//paragraph,
>       cts:field-word-query("paragraph", "search query"),
>       ("score-simple"))[1 to 250])
>   let $count := xdmp:estimate(
>     cts:search(
>       /manual/translation//paragraph,
>       cts:field-word-query("paragraph", $p),
>       ("score-simple")))
>   order by number($count) descending
>   return <result count="{$count}">{$p}</result>
> )[1 to 10]
>
> This is often very fast, but can take far too long if I
> happen to hit a batch of documents/fragments that weren't hit
> recently.
>
> Is there more I can do here?
> Or is there a completely different aproach that may yield
> better results?
> And how do I get mixed content results?
>
> Thanks for reading through all this!
>
> Kind regards,
>
> Laurens van den Oever
> Xopus BV
>
> http://xopus.com
> +31 70 4452345
> KvK 27301795
>
>

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Sorting by the number of occurences of a paragraph

Reply via email to