I think the key bit is here:

“MarkLogic indexes work at the fragment/document level.  So doing a reverse 
query 20 times against different subparts of a document is going to involve 
brute force filtering to see if the match was in the needed part or not.”

That suggests that our general approach to using reverse queries is flawed for 
this reason and would explain the apparent poor performance.

It’s not possible to break the current docs into smaller docs but it might be 
possible to configure fragmentation at a level where each fragment would only 
have one element we need to match on (e.g., titles).

Another question: having gotten a result from a reverse search at the full 
document level, is there a way to know *which* queries matched? If so then it 
would be easy enough to apply those queries to the relevant elements to do 
additional filtering (although I suppose that might get us back to the same 
place).

Unfortunately my current performance metrics are “it takes way too long now and 
needs to take at least ½ as long”. I need to do more work to get some useful 
measurements and do some calculations to determine what a reasonable 
performance should be (e.g., we have X million cases to check and 100ms per it 
should take about Y time but it takes Y*n time—why?).

Ultimately I need to try to determine how fast it *should* be for this type of 
operation. If I can determine that then I can determine whether the throughput 
requirements can be met by simply achieving that performance with the current 
server configuration or determine that it cannot and that we need to scale up, 
e.g, add additional D-nodes or something. 

I realize that nobody can offer me solid numbers based on what little I can 
share about the project details, other than to suggest some bounds.

In particular, if I have 125,000 reverse queries applied to a single document 
(assuming that total database volume doesn’t affect query speed in this case) 
on a modern fast server with appropriate indexes in place, how fast should I 
expect that query to take? 1ms?, 10ms?, 100ms? 1 second?

Based on my experience with ML and the documentation I would expect something 
around 10ms.

Our corpus has about 25 million elements that would be fragments per the advice 
above (about 1.5 million full documents). 

If we assume 10ms per query per fragment then it would take about 3 days to 
process all of them. Currently it takes 9, so roughly a 3x slowdown over what I 
think we could expect +/- 1 day (there’s other overhead in this 9-day number 
that may or may not be reduceable).

I’ve never done much with fragments in MarkLogic so I’m not sure what the full 
implication of making these subelements into fragments would be for other 
processing.

Cheers,

Eliot
--
Eliot Kimber
http://contrext.com
 


On 5/1/17, 9:43 PM, "Jason Hunter" <[email protected] on 
behalf of [email protected]> wrote:

    So what's the performance you're seeing?
    
    And what do you expect to be able to see?
    
    Something to consider:  MarkLogic indexes work at the fragment/document 
level.  So doing a reverse query 20 times against different subparts of a 
document is going to involve brute force filtering to see if the match was in 
the needed part or not.  Might be better to have 20 documents instead of 1.
    
    -jh-
    
    > On May 2, 2017, at 01:29, Eliot Kimber <[email protected]> wrote:
    > 
    > Actually, its expected that every element will be matched by at least one 
query. This is a classification application and the intent of the application 
is that every element of interest will be classified. Many, if not most, of the 
queries depend on word-search features, e.g., stemmed matches, case 
insensitivity, etc. 
    > 
    > I’m new to this project so it may be that there is a better way to 
approach the problem in general. This is the system as currently implemented.
    > 
    > My overall charge is to improve the throughput performance so my first 
task is to first understand what the performance bottlenecks are then identify 
possible solutions.
    > 
    > It seems unlikely that we’ve done something silly in our queries or ML 
configuration but I want to eliminate the easy-to-fix before exploring more 
complicated options. 
    > 
    > Cheers,
    > 
    > Eliot
    > 
    > --
    > Eliot Kimber
    > http://contrext.com
    > 
    > 
    > 
    > On 5/1/17, 12:10 PM, "Jason Hunter" 
<[email protected] on behalf of 
[email protected]> wrote:
    > 
    >> The processing is, for each document to be processed, examine on the 
order of 10-20 elements to see if they match the reverse query by getting the 
node to be looked up and then doing:
    > 
    >    Maybe you can reverse query on the document as a whole instead of 
running 20 reverse queries per document.  Only bother with the enumeration of 
the 20 if there's a proven hit within the document.
    > 
    >    (I assume the vast majority of the time there's not going to be hits.  
If that's true then why not prove that in one pop instead of 20 pops.)
    > 
    >    -jh-
    > 
    >    _______________________________________________
    >    General mailing list
    >    [email protected]
    >    Manage your subscription at: 
    >    http://developer.marklogic.com/mailman/listinfo/general
    > 
    > 
    > 
    > 
    > _______________________________________________
    > General mailing list
    > [email protected]
    > Manage your subscription at: 
    > http://developer.marklogic.com/mailman/listinfo/general
    
    _______________________________________________
    General mailing list
    [email protected]
    Manage your subscription at: 
    http://developer.marklogic.com/mailman/listinfo/general
    


_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to