By “which query” I mean which of the 125,000 separate query docs actually 
matched for a given cts:reverse-query() call. 

I guess my question is: in the case where the reverse query is applied to an 
element that is not a full document, does the “brute force” have to be applied 
for every candidate query or only for those that match containing document of 
the input element? 

If the brute force cost is applied to each query then doing a two-phase search 
would be faster: determine which reverse queries apply to the input document 
and then use those to find the elements within the input document that actually 
matched. But if the brute force cost only applies to those queries that match 
the containing doc then ML internally must produce the faster result than doing 
it in my own code. 

But as you say, that calls into the question the use of reverse queries at all: 
why not simply run the 125,000 forward queries and update each element matched 
as appropriate?

Or it may simply be that we need to do some horizontal scaling and invest in 
additional D-nodes.

Cheers,

E.
--
Eliot Kimber
http://contrext.com
 


On 5/1/17, 10:26 PM, "Jason Hunter" <[email protected] on 
behalf of [email protected]> wrote:

    > Another question: having gotten a result from a reverse search at the 
full document level, is there a way to know *which* queries matched? If so then 
it would be easy enough to apply those queries to the relevant elements to do 
additional filtering (although I suppose that might get us back to the same 
place).
    
    I'm a little confused.  You're putting multiple serialized queries into 
each document?  If you have just one serialized query in a document it's going 
to be obvious which query was the reverse match -- it was that one.
    
    > In particular, if I have 125,000 reverse queries applied to a single 
document (assuming that total database volume doesn’t affect query speed in 
this case) on a modern fast server with appropriate indexes in place, how fast 
should I expect that query to take? 1ms?, 10ms?, 100ms? 1 second?
    
    If you have 125,000 documents each with a serialized query in it and you do 
a reverse query for one document against those serialized queries and there's 
no hits, it should be extremely fast.  More hits will slow things a little bit 
because hits involve a little work.  The IMLS paper explains what the algorithm 
has to do.  I suspect (but haven't measured) that it's a lot like forward 
queries in that the timing depends a lot on number of matches.
    
    > Our corpus has about 25 million elements that would be fragments per the 
advice above (about 1.5 million full documents). 
    
    If you have 25 million elements you want to run against 125,000 serialized 
queries, wouldn't forward queries be faster?  You'd only have to do 125,000 
search calls instead of 25,000,000.  :)
    
    > I’ve never done much with fragments in MarkLogic so I’m not sure what the 
full implication of making these subelements into fragments would be for other 
processing.
    
    Yeah, fragmentation is not to be done lightly.
    
    -jh-
    
    _______________________________________________
    General mailing list
    [email protected]
    Manage your subscription at: 
    http://developer.marklogic.com/mailman/listinfo/general
    


_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to