Hi Geert, Thanks for your response, your input is certainly valuable. I'll let you know about the results.
> Thirdly, you select top ten on the outside of the for-loop. If it is possible to get that in the for expression of your for- > loop, that should speed things up much as well Is there a common pattern to do that? I need the top 10 items after the order by. Kind regards, Laurens van den Oever Xopus BV http://xopus.com +31 70 4452345 KvK 27301795 Date: Mon, 27 Jul 2009 16:26:00 +0200 From: Geert Josten <[email protected]> Subject: RE: [MarkLogic Dev General] Sorting by the number of occurences of a paragraph To: General Mark Logic Developer Discussion <[email protected]> Message-ID: <0260356c6dfe754ba6fa48e659a14338269cae7...@helios.olympus.borgus.nl> Content-Type: text/plain; charset="Windows-1252" Hi Laurens, Have you looked into the cts:element-values and related functions? These are purely based on the MarkLogic Server indexes and are by far quicker than calls to distinct-values. And not sure if it makes difference, but you could also use cts:remainder instead of xdmp:estimate with a search as argument. Thirdly, you select top ten on the outside of the for-loop. If it is possible to get that in the for expression of your for-loop, that should speed things up much as well. Your statements about timings seem to indicate your performance is relying on caching within MarkLogic Server, but using index based functions only makes caching unnecessary.. HTH, Geert > Drs. G.P.H. Josten Consultant http://www.daidalos.nl/ Daidalos BV Source of Innovation Hoekeindsehof 1-4 2665 JZ Bleiswijk Tel.: +31 (0) 10 850 1200 Fax: +31 (0) 10 850 1199 http://www.daidalos.nl/ KvK 27164984 De informatie - verzonden in of met dit emailbericht - is afkomstig van Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen rechten worden ontleend. > From: [email protected] > [mailto:[email protected]] On Behalf Of > Laurens van den Oever > Sent: maandag 27 juli 2009 16:11 > To: [email protected] > Subject: [MarkLogic Dev General] Sorting by the number of > occurences of a paragraph > > Hi all, > > I'm pretty new to MarkLogic, so chances are that I've made > some trivial mistake here. > > > I have roughly the following structure: > > <manual> > <translation lang="..."><!-- no xml:lang due to legacy --> > > <!-- arbritary nesting of other elements --> > <paragraph> > > I have about 5000 manuals with on average 16 translations > each, bringing the total of distinct (!) paragraphs to 700000. > The goal is to stimulate content reuse from the authoring interface. > I want to show the authors about 10 paragraphs which contain > a search phrase and here it comes: ordered by the number of > occurences of that paragraph in the collection. > I assume that a distinct paragraph only occurs once in a translation. > > I realize that I'm trying to achieve something close to > impossible; expecting fast results from a query that compares > a large part of the db against the whole db, but I'm amazed > that I've come this far and I'd like to see if I can get this > to the next level. > > I started with the following query: > > (for $para in cts:search(//paragraph, > cts:element-word-query(xs:QName("paragraph"), "search phrase")) > let $count := xdmp:estimate(cts:search(//paragraph, > cts:element-word-query(xs:QName("paragraph"), $para))) > order by number($count) descending > return > <result count="{$count}"> > {$para} > </result> > )[1 to 10] > > There are two problems with this approach: > 1. it is far too slow > 2. it returns multiple occurrences of the same content > > I've been able to improve performance with the following measures: > - Maximizing the number of initial search results. > - Refragmenting the database on <translation/> level. > - Made <paragraph/> the root of a field. > - Reduced the scope of the query to one language using a > [...@lang="EN"] predicate but that slowed things down. > - Simple scoring improved performance and accuracy as > relevance seems to contradict my quest for the most occurences. > > To eliminate the multiple occurrences I've used > fn:distinct-values, but the downside is that it returns a > string and I need the paragraph element including all markup. > Now my new query is: > > (for $p in fn:distinct-values( > cts:search( > /manual/translation//paragraph, > cts:field-word-query("paragraph", "search query"), > ("score-simple"))[1 to 250]) > let $count := xdmp:estimate( > cts:search( > /manual/translation//paragraph, > cts:field-word-query("paragraph", $p), > ("score-simple"))) > order by number($count) descending > return <result count="{$count}">{$p}</result> > )[1 to 10] > > This is often very fast, but can take far too long if I > happen to hit a batch of documents/fragments that weren't hit > recently. > > Is there more I can do here? > Or is there a completely different aproach that may yield > better results? > And how do I get mixed content results? > > Thanks for reading through all this! > > Kind regards, > > Laurens van den Oever > Xopus BV > > http://xopus.com > +31 70 4452345 > KvK 27301795 > >
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
