Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first?

Jason Hunter Sat, 21 Apr 2012 22:15:32 -0700

Term lists aren't generated during the query.  They're generated as part of 
document storage which is when indexing happens.  At query time they're pulled 
off disk or from the list cache, then intersected with the other term lists 
you're and'ing against.


If the second term list is a subset of the first then just don't include it in 
the and-query.  That's not something that's knowable by MarkLogic, has to be 
done by the query author.  MarkLogic isn't going to silently ignore one of your 
and-query constraints.

Accessing very long term lists does take a little bit of time, mostly to read 
off disk the first time, then a little bit more to do the intersection, but 
it's log2(n) speed so pretty fast.  I did find that if you have a collection 
that matches 99% of your documents (such as in MarkMail the collection of 
messages to search) it's faster to have another collection for the ones not to 
search (the spams) that you exclude.  Faster to do an and-not-query than an 
and-query because it keeps "n" much smaller

There's lots of good discussion about indexing in 
http://community.marklogic.com/inside-marklogic.  (And the author is very good 
looking.)

-jh-

On Apr 20, 2012, at 10:45 PM, seme...@hotmail.com wrote:

> But doesn't it still take time to generate the second term list (the 
> expensive one)? And if the second term list (expensive) will only be a subset 
> of the first term list (inexpensive) wouldn't there be performance gains to 
> not intersect the two lists that were created independently (from the entire 
> index) but rather create the second list out of the first list?
> 
> Something somewhere has got to be checking values against each other and it 
> seems like the fewer checks that need to happen the faster things will go. 
> But maybe not
> 
> -Ryan
> 
> From: jhun...@marklogic.com
> Date: Fri, 20 Apr 2012 22:37:03 -0700
> To: general@developer.marklogic.com
> Subject: Re: [MarkLogic Dev General] How to give hints to MarkLogic on which  
> condition is faster to check first?
> 
> Bottom line: For your word count use-case you should get the performance you 
> want if you can write it as a cts:search of a cts:and-query.
> 
> But it doesn't quite work like you're thinking.  It's not ordered execution 
> like it would be in Java.  It doesn't short-circuit.  To run a cts:search 
> there's no looking inside documents (at least not til the final filtering 
> phase to verify results).  It's all just term list set arithmetic.
> 
> In your example, the num-words constraint will quickly determine the 
> documents with the right number of words, and that list of document ids taken 
> from the index will be intersected with the document ids matching the other 
> constraint(s) based on indexes.  If the other constraints aren't selective (a 
> better word to use here than expensive) then it's OK because the first 
> constraint is highly selective.  Any documents with phrases that match the 
> second constraint but not the first (wrong # of words) won't be included 
> because they don't intersect.
> 
> Intersecting against long term lists is efficient so there's no need for 
> MarkLogic to short circuit.  Your subqueries are resolved to the extent 
> possible by the indexes.
> 
> -jh-
> 
> On Apr 20, 2012, at 8:54 PM, seme...@hotmail.com wrote:
> 
> So could I do :
> 
> cts:search(/,
>     cts:and-query(
>         cts:inexpensive-query...
>         ,
>         cts:expensive-query...
>     )
> )
> 
> and MarkLogic will check the first condition (cts:inexpensive-query) first 
> and only check the second condition if the first is true?
> 
> 
> CC: general@developer.marklogic.com
> From: m...@blakeley.com
> Date: Fri, 20 Apr 2012 19:49:17 -0700
> To: general@developer.marklogic.com
> Subject: Re: [MarkLogic Dev General] How to give hints to MarkLogic on which  
> condition is faster to check first?
> 
> Yes, boolean ops will short-circuit. You can test this for yourself using 
> xdmp:sleep and xdmp:elapsed-time.
> 
> -- Mike
> 
> On Apr 20, 2012, at 19:15, "seme...@hotmail.com" <seme...@hotmail.com> wrote:
> 
> I may have some queries where the comparison is expensive. So what I'd like 
> to do is add an extra element in each doc which is a "shortcut" to check for 
> first before doing the expensive comparison.
> 
> For example, so suppose I had data that had randoms words ("boat", 
> "alligator", "house") in an element called "words" and there was a random 
> number of words in the element (say, 5 to 50). Other documents may have the 
> same words but in a different order. I want to find the documents that have 
> the same number of words and the same words regardless of the order.
> 
> So 
> 
> 1: <words>boat alligator house bandit flower</words>
> 
> and 
> 2: <words>bandit alligator bandit flower boat</words>
> would be a match
> 
> but 
> 3: <words>bandit alligator bandit flower boat island</words>
> would not be a match because is has an extra word
> 
> I thought that I can add a new element to each doc which represents the 
> number of words (5, 6, 11, 20, etc) and I can first check that the doc has 
> the right number of words before I check to see if it has the same words. I 
> am thinking the extra check on the number of words would shortcut the query 
> to not even bother checking individual words if the number of words doesn't 
> match and save me some time. The time may add up if I have millions or tens 
> of millions of docs to query against.
> 
> So if my thinking is correct, then I would have documents that look like this:
> 
> <doc>
>     <words>bandit alligator bandit flower boat</words>
>     <num-words>5</num-words>
> </doc>
> 
> I could put a range index on the "num-words" element of type xs:int.
> 
> Then I'd like to write queries so that the num-words condition is checked 
> first by the magical MarkLogic engine and only if that first condition is met 
> would it check the rest.
> 
> I know in Java that the runtime environment won't check the second condition 
> if the first is false in a boolean statement. So:
> 
> if (1 == 0 && explode()) { ....
> 
> "explode()" will never be called because the first condition in the statement 
> is false. But the order is important; "1 == 0" must be before "explode()" in 
> the statement because that statement will be evaluated from left to right.
> 
> I don't know if XQuery or MarkLogic works that way (didn't see anything in 
> the spec) and I know that MarkLogic has all sorts of optimizations, but how 
> will it know that it's faster to check the "num-words" condition before the 
> individual words? Can I write a cts:query that gives a hint to MarkLogic to 
> give precedence to one condition over another to save time? *I* know that the 
> num-words check is faster but how can *MarkLogic* know that? 
> 
> I suppose it could be argued that it doesn't really matter because MarkLogic 
> runs fast anyway, but I'm talking about long running queries over massive 
> data sets so even small amounts of time are important to me.
> 
> Thanks!
> 
> -Ryan
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________ General mailing list 
> General@developer.marklogic.comhttp://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 
> 
> _______________________________________________ General mailing list 
> General@developer.marklogic.comhttp://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to give hints to MarkLogic on which condition is faster to check first?

Reply via email to