Laurens, I didn't see your error, but I'm guessing you did not pass the collation as an option in your call to cts:element-values(). You need to match namespace, localname, and collation.
I should have said to create the range index first, then add the hash-id attribute. That would result in a single reindexing of your content (adding a new index will not cause documents to be reindexed unless there are actually things that need to be reindexed). However, it seemed easier to explain in the order I explained it. :-) As for your reading of the query trace, in general the time to evaluate a query in MarkLogic is proportional to the number of results, not the complexity of the query. Having more constraints oftentimes results in faster queries. Kelly Message: 1 Date: Wed, 29 Jul 2009 14:29:01 +0200 From: Laurens van den Oever <[email protected]> Subject: [MarkLogic Dev General] RE: Sorting by the number of occurences ofa paragraph To: general <[email protected]> Message-ID: <[email protected]> Content-Type: text/plain; charset="iso-8859-1" Hi Kelly, Thank you for your excellent response. Your solution seems to do exactly what I need. I have removed my fragmentation and field, set the hash-id attribute on all 4M paragraphs and added the attribute range index. Unfortunately I then got an exception that no element-attribute range index exists for the given element/attribute QNames. I couldn't find anything wrong with my settings and localnames/namepaces. I assume that the problem was caused by messing with the reindexing settings while refragmenting/reindexing. Is that possible? I've now removed the index and am waiting for the reindexing to complete. After that I will add the index again. > I also don't think you need to limit to a specific language, but that shouldn't slow things down if you want to use it The query-trace showed that the extra predicate needed to be filtered while the rest of the xpath could be resolved from the indexes. I had the feeling that removing it resulted in better performance, but I've not done any thorough testing and I had made other changes as well. I will let you know when I have the final results. Kind regards, Laurens van den Oever Xopus BV http://xopus.com +31 70 4452345 KvK 27301795 ----- Original Message ----- From: [email protected] <[email protected]> To: [email protected] <[email protected]> Sent: Wed Jul 29 12:00:07 2009 Subject: General Digest, Vol 61, Issue 41 Send General mailing list submissions to [email protected] To subscribe or unsubscribe via the World Wide Web, visit http://xqzone.com/mailman/listinfo/general or, via email, send a message with subject or body 'help' to [email protected] You can reach the person managing the list at [email protected] When replying, please edit your Subject line so it is more specific than "Re: Contents of General digest..." Today's Topics: 1. RE: Sorting by the number of occurences of a paragraph (Laurens van den Oever) 2. RE: PDF conversion trial (Baranov, Ivan - Moscow) ---------------------------------------------------------------------- Message: 1 Date: Wed, 29 Jul 2009 14:29:01 +0200 From: Laurens van den Oever <[email protected]> Subject: [MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph To: general <[email protected]> Message-ID: <[email protected]> Content-Type: text/plain; charset="iso-8859-1" Hi Kelly, Thank you for your excellent response. Your solution seems to do exactly what I need. I have removed my fragmentation and field, set the hash-id attribute on all 4M paragraphs and added the attribute range index. Unfortunately I then got an exception that no element-attribute range index exists for the given element/attribute QNames. I couldn't find anything wrong with my settings and localnames/namepaces. I assume that the problem was caused by messing with the reindexing settings while refragmenting/reindexing. Is that possible? I've now removed the index and am waiting for the reindexing to complete. After that I will add the index again. > I also don't think you need to limit to a specific language, but that shouldn't slow things down if you want to use it The query-trace showed that the extra predicate needed to be filtered while the rest of the xpath could be resolved from the indexes. I had the feeling that removing it resulted in better performance, but I've not done any thorough testing and I had made other changes as well. I will let you know when I have the final results. Kind regards, Laurens van den Oever Xopus BV http://xopus.com +31 70 4452345 KvK 27301795 Date: Mon, 27 Jul 2009 10:34:34 -0700 From: Kelly Stirman <[email protected]> Subject: [MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph To: "[email protected]" <[email protected]> Hi Laurent, If I follow your design correctly, what I would do is the following: 1) iterate over all your paragraphs and use xdmp:md5() to generate a hash value 2) add this hash value as an attribute to each paragraph, e.g. <paragraph hash-id="abc123">hello world</paragraph> 3) create a string range index in the codepoint collation on the paragraph/@hash-id attribute Then to return paragraphs in frequency order, you can call cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency"). You can filter this list with any search expression by adding another the cts:query as another option (see below). This approach allows you to quickly get the hash-id in frequency order, with or without a cts:query. You'll then need to go get a paragraph that matches the hash-id. Because there may be many, you can simply grab the first. let $q:= "search phrase" for $id in cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency",$q) return element result {attribute count {cts:frequency($id)},(//paragra...@hash-id eq $id])[1]} Finally, before doing any of this, I would get rid of your fragmentation. You probably don't need fields, but we can continue to talk about how they might be useful for this task. I also don't think you need to limit to a specific language, but that shouldn't slow things down if you want to use it (be sure to look over our developer guide on using languages, and your server license *may* come into play on this subject). This should be very fast - well under a second as long as there aren't too many paragraphs being returned. Getting the hash-ids will be resolved out of the indexes, whereas each paragraph returned will incur a disk i/o. 100 or so results should be sub-second. Kelly Message: 4 Date: Mon, 27 Jul 2009 16:11:16 +0200 From: Laurens van den Oever <[email protected]> Subject: [MarkLogic Dev General] Sorting by the number of occurences of a paragraph To: [email protected] Message-ID: <[email protected]> Content-Type: text/plain; charset="iso-8859-1" Hi all, I'm pretty new to MarkLogic, so chances are that I've made some trivial mistake here. I have roughly the following structure: <manual> <translation lang="..."><!-- no xml:lang due to legacy --> <!-- arbritary nesting of other elements --> <paragraph> I have about 5000 manuals with on average 16 translations each, bringing the total of distinct (!) paragraphs to 700000. The goal is to stimulate content reuse from the authoring interface. I want to show the authors about 10 paragraphs which contain a search phrase and here it comes: ordered by the number of occurences of that paragraph in the collection. I assume that a distinct paragraph only occurs once in a translation. I realize that I'm trying to achieve something close to impossible; expecting fast results from a query that compares a large part of the db against the whole db, but I'm amazed that I've come this far and I'd like to see if I can get this to the next level. I started with the following query: (for $para in cts:search(//paragraph, cts:element-word-query(xs:QName("paragraph"), "search phrase")) let $count := xdmp:estimate(cts:search(//paragraph, cts:element-word-query(xs:QName("paragraph"), $para))) order by number($count) descending return <result count="{$count}"> {$para} </result> )[1 to 10] There are two problems with this approach: 1. it is far too slow 2. it returns multiple occurrences of the same content I've been able to improve performance with the following measures: - Maximizing the number of initial search results. - Refragmenting the database on <translation/> level. - Made <paragraph/> the root of a field. - Reduced the scope of the query to one language using a [...@lang="EN"] predicate but that slowed things down. - Simple scoring improved performance and accuracy as relevance seems to contradict my quest for the most occurences. To eliminate the multiple occurrences I've used fn:distinct-values, but the downside is that it returns a string and I need the paragraph element including all markup. Now my new query is: (for $p in fn:distinct-values( cts:search( /manual/translation//paragraph, cts:field-word-query("paragraph", "search query"), ("score-simple"))[1 to 250]) let $count := xdmp:estimate( cts:search( /manual/translation//paragraph, cts:field-word-query("paragraph", $p), ("score-simple"))) order by number($count) descending return <result count="{$count}">{$p}</result> )[1 to 10] This is often very fast, but can take far too long if I happen to hit a batch of documents/fragments that weren't hit recently. Is there more I can do here? Or is there a completely different aproach that may yield better results? And how do I get mixed content results? Thanks for reading through all this! Kind regards, Laurens van den Oever Xopus BV http://xopus.com +31 70 4452345 KvK 27301795 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://xqzone.marklogic.com/pipermail/general/attachments/20090729/64414ad2/attachment-0001.html ------------------------------ Message: 2 Date: Wed, 29 Jul 2009 14:08:52 +0100 From: "Baranov, Ivan - Moscow" <[email protected]> Subject: RE: [MarkLogic Dev General] PDF conversion trial To: General Mark Logic Developer Discussion <[email protected]> Message-ID: <[email protected]> Content-Type: text/plain; charset="utf-8" Thank you for your advice David, I'm trying this also for sure! Van -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of David Sewell Sent: Tuesday, July 28, 2009 5:37 PM To: General Mark Logic Developer Discussion Subject: Re: [MarkLogic Dev General] PDF conversion trial It's worth comparing ML's PDF-to-XML (and XHTML) conversion against the export facility in Adobe Acrobat 9, if you have it. I've recently been evaluating the two. Neither is perfect, and they differ in exactly where their strengths and weaknesses are. It is very difficult to get letter-perfect XML/XHTML conversion from PDF, if the source is complex, because the underlying PDF data has all sorts of font changes, typographic features, and other things that cause "interference" in the output. For example, in converting the PDF from a typeset book containing wide angle brackets (U+2329 / U+232A or similar), the Acrobat export consistently captured them with styled <span>s, while the MarkLogic export sometimes captured them and sometimes dropped them or substituted '( )'. On the other hand, MarkLogic normalized ligature "???"correctly as "fi", but Acrobat inserts an extra space, "fi " for no good reason. MarkLogic's PDF conversion pipelines give you more options over how the output will be structured than Acrobat does. DS On Tue, 28 Jul 2009, Baranov, Ivan - Moscow wrote: > Hi All > > I've recently tried to convert PDF to XML using built-it function > xdmp:pdf-convert() and discovered that my company's license does not > allow this. Actually I have my own converter so I just wanted to try > if ML does it better or faster and now I'm curious about, is there any > way to acquire such functionality on a trial basis? > Thanks, > Van > -- David Sewell, Editorial and Technical Manager ROTUNDA, The University of Virginia Press PO Box 801079, Charlottesville, VA 22904-4318 USA Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903 Email: [email protected] Tel: +1 434 924 9973 Web: http://rotunda.upress.virginia.edu/ ------------------------------ _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general End of General Digest, Vol 61, Issue 41 ***************************************
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
