[MarkLogic Dev General] Re: sorting by the number of occurences ofa paragraph

Kelly Stirman Wed, 29 Jul 2009 14:32:20 -0700

Laurens,

I didn't see your error, but I'm guessing you did not pass the collation as an 
option in your call to cts:element-values(). You need to match namespace, 
localname, and collation.

I should have said to create the range index first, then add the hash-id 
attribute. That would result in a single reindexing of your content (adding a 
new index will not cause documents to be reindexed unless there are actually 
things that need to be reindexed). However, it seemed easier to explain in the 
order I explained it. :-)

As for your reading of the query trace, in general the time to evaluate a query 
in MarkLogic is proportional to the number of results, not the complexity of 
the query. Having more constraints oftentimes results in faster queries. 

Kelly 

Message: 1
Date: Wed, 29 Jul 2009 14:29:01 +0200
From: Laurens van den Oever <[email protected]>
Subject: [MarkLogic Dev General] RE: Sorting by the number of
occurences ofa paragraph
To: general <[email protected]>
Message-ID:
<[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hi Kelly,

Thank you for your excellent response. Your solution seems to do exactly
what I need.

I have removed my fragmentation and field, set the hash-id attribute on all
4M paragraphs and added the attribute range index.
Unfortunately I then got an exception that no element-attribute range index
exists for the given element/attribute QNames.
I couldn't find anything wrong with my settings and localnames/namepaces.
I assume that the problem was caused by messing with the reindexing settings
while refragmenting/reindexing. Is that possible?

I've now removed the index and am waiting for the reindexing to complete.
After that I will add the index again.

>  I also don't think you need to limit to a specific language, but that
shouldn't slow things down if you want to use it

The query-trace showed that the extra predicate needed to be filtered while
the rest of the xpath could be resolved from the indexes.
I had the feeling that removing it resulted in better performance, but I've
not done any thorough testing and I had made other changes as well.

I will let you know when I have the final results.

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795

----- Original Message -----
From: [email protected] 
<[email protected]>
To: [email protected] <[email protected]>
Sent: Wed Jul 29 12:00:07 2009
Subject: General Digest, Vol 61, Issue 41

Send General mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://xqzone.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of General digest..."

Today's Topics:

   1. RE: Sorting by the number of occurences of        a paragraph
      (Laurens van den Oever)
   2. RE: PDF conversion trial (Baranov, Ivan - Moscow)

----------------------------------------------------------------------

Message: 1
Date: Wed, 29 Jul 2009 14:29:01 +0200
From: Laurens van den Oever <[email protected]>
Subject: [MarkLogic Dev General] RE: Sorting by the number of
        occurences of   a paragraph
To: general <[email protected]>
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hi Kelly,

Thank you for your excellent response. Your solution seems to do exactly
what I need.

I have removed my fragmentation and field, set the hash-id attribute on all
4M paragraphs and added the attribute range index.
Unfortunately I then got an exception that no element-attribute range index
exists for the given element/attribute QNames.
I couldn't find anything wrong with my settings and localnames/namepaces.
I assume that the problem was caused by messing with the reindexing settings
while refragmenting/reindexing. Is that possible?

I've now removed the index and am waiting for the reindexing to complete.
After that I will add the index again.

>  I also don't think you need to limit to a specific language, but that
shouldn't slow things down if you want to use it

The query-trace showed that the extra predicate needed to be filtered while
the rest of the xpath could be resolved from the indexes.
I had the feeling that removing it resulted in better performance, but I've
not done any thorough testing and I had made other changes as well.

I will let you know when I have the final results.

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795

Date: Mon, 27 Jul 2009 10:34:34 -0700
From: Kelly Stirman <[email protected]>
Subject: [MarkLogic Dev General] RE: Sorting by the number of
       occurences of   a paragraph
To: "[email protected]"
       <[email protected]>

Hi Laurent,

If I follow your design correctly, what I would do is the following:

1) iterate over all your paragraphs and use xdmp:md5() to generate a hash
value
2) add this hash value as an attribute to each paragraph, e.g. <paragraph
hash-id="abc123">hello world</paragraph>
3) create a string range index in the codepoint collation on the
paragraph/@hash-id attribute

Then to return paragraphs in frequency order, you can call
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency").
You can filter this list with any search expression by adding another the
cts:query as another option (see below).

This approach allows you to quickly get the hash-id in frequency order, with
or without a cts:query. You'll then need to go get a paragraph that matches
the hash-id. Because there may be many, you can simply grab the first.

let $q:= "search phrase"
for $id in
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency",$q)
return element result {attribute count
{cts:frequency($id)},(//paragra...@hash-id eq $id])[1]}

Finally, before doing any of this, I would get rid of your fragmentation.
You probably don't need fields, but we can continue to talk about how they
might be useful for this task. I also don't think you need to limit to a
specific language, but that shouldn't slow things down if you want to use it
(be sure to look over our developer guide on using languages, and your
server license *may* come into play on this subject).

This should be very fast - well under a second as long as there aren't too
many paragraphs being returned. Getting the hash-ids will be resolved out of
the indexes, whereas each paragraph returned will incur a disk i/o. 100 or
so results should be sub-second.

Kelly

Message: 4
Date: Mon, 27 Jul 2009 16:11:16 +0200
From: Laurens van den Oever <[email protected]>
Subject: [MarkLogic Dev General] Sorting by the number of occurences
       of a    paragraph
To: [email protected]
Message-ID:
       <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hi all,
I'm pretty new to MarkLogic, so chances are that I've made some trivial
mistake here.

I have roughly the following structure:

<manual>
 <translation lang="..."><!-- no xml:lang due to legacy -->
   <!-- arbritary nesting of other elements -->
     <paragraph>

I have about 5000 manuals with on average 16 translations each, bringing the
total of distinct (!) paragraphs to 700000.
The goal is to stimulate content reuse from the authoring interface.
I want to show the authors about 10 paragraphs which contain a search phrase
and here it comes: ordered by the number of occurences of that paragraph in
the collection.
I assume that a distinct paragraph only occurs once in a translation.

I realize that I'm trying to achieve something close to impossible;
expecting fast results from a query that compares a large part of the db
against the whole db, but I'm amazed that I've come this far and I'd like to
see if I can get this to the next level.

I started with the following query:

 (for $para in cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), "search phrase"))
 let $count := xdmp:estimate(cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), $para)))
 order by number($count) descending
 return
 <result count="{$count}">
   {$para}
 </result>
 )[1 to 10]

There are two problems with this approach:
1. it is far too slow
2. it returns multiple occurrences of the same content

I've been able to improve performance with the following measures:
- Maximizing the number of initial search results.
- Refragmenting the database on <translation/> level.
- Made <paragraph/> the root of a field.
- Reduced the scope of the query to one language using a [...@lang="EN"]
predicate but that slowed things down.
- Simple scoring improved performance and accuracy as relevance seems to
contradict my quest for the most occurences.

To eliminate the multiple occurrences I've used fn:distinct-values, but the
downside is that it returns a string and I need the paragraph element
including all markup.
Now my new query is:

 (for $p in fn:distinct-values(
   cts:search(
     /manual/translation//paragraph,
     cts:field-word-query("paragraph", "search query"),
     ("score-simple"))[1 to 250])
 let $count := xdmp:estimate(
   cts:search(
     /manual/translation//paragraph,
     cts:field-word-query("paragraph", $p),
     ("score-simple")))
 order by number($count) descending
 return <result count="{$count}">{$p}</result>
)[1 to 10]

This is often very fast, but can take far too long if I happen to hit a
batch of documents/fragments that weren't hit recently.

Is there more I can do here?
Or is there a completely different aproach that may yield better results?
And how do I get mixed content results?

Thanks for reading through all this!

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://xqzone.marklogic.com/pipermail/general/attachments/20090729/64414ad2/attachment-0001.html

------------------------------

Message: 2
Date: Wed, 29 Jul 2009 14:08:52 +0100
From: "Baranov, Ivan - Moscow" <[email protected]>
Subject: RE: [MarkLogic Dev General] PDF conversion trial
To: General Mark Logic Developer Discussion
        <[email protected]>
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="utf-8"

Thank you for your advice David, I'm trying this also for sure!

Van

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of David Sewell
Sent: Tuesday, July 28, 2009 5:37 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] PDF conversion trial

It's worth comparing ML's PDF-to-XML (and XHTML) conversion against the export 
facility in Adobe Acrobat 9, if you have it. I've recently been evaluating the 
two. Neither is perfect, and they differ in exactly where their strengths and 
weaknesses are. It is very difficult to get letter-perfect XML/XHTML conversion 
from PDF, if the source is complex, because the underlying PDF data has all 
sorts of font changes, typographic features, and other things that cause 
"interference" in the output.

For example, in converting the PDF from a typeset book containing wide angle 
brackets (U+2329 / U+232A or similar), the Acrobat export consistently captured 
them with styled <span>s, while the MarkLogic export sometimes captured them 
and sometimes dropped them or substituted '( )'. On the other hand, MarkLogic 
normalized ligature "???"correctly as "fi", but Acrobat inserts an extra space, 
"fi " for no good reason.

MarkLogic's PDF conversion pipelines give you more options over how the output 
will be structured than Acrobat does.

DS

On Tue, 28 Jul 2009, Baranov, Ivan - Moscow wrote:

> Hi All
>
> I've recently tried to convert PDF to XML using built-it function
> xdmp:pdf-convert() and discovered that my company's license does not 
> allow this.  Actually I have my own converter so I just wanted to try 
> if ML does it better or faster and now I'm curious about, is there any 
> way to acquire such functionality on a trial basis?

> Thanks,
> Van
>

--
David Sewell, Editorial and Technical Manager ROTUNDA, The University of 
Virginia Press PO Box 801079, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/

------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

End of General Digest, Vol 61, Issue 41
***************************************

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] Re: sorting by the number of occurences ofa paragraph

Reply via email to