[MarkLogic Dev General] Getting Impossible Value from count()--why?

Eliot Kimber Tue, 22 Aug 2017 12:52:53 -0700

Using ML 8.0-3.2

As part of my profiling application I run a large number of profiles, storing 
the profiler results back to the database. I’m then extracting the times from 
the profiling data to create histograms and do other analysis.


My first attempt to do this with buckets ran into the problem that the 
index-based buckets were not returning accurate numbers, so I reimplemented it 
to construct the buckets manually from a list of the actual duration values.

My code is:

let $durations as xs:dayTimeDuration* := 
epf:get-enrichment-durations($collection)
let $search-range := epf:construct-search-range()
let $facets :=
    for $bucket in $search-range/search:bucket
    let $upper-bound := if ($bucket/@lt) then xs:dayTimeDuration($bucket/@lt) 
else xs:dayTimeDuration("PT0S")
    let $lower-bound := xs:dayTimeDuration($bucket/@ge)
    let $count := count($durations[. lt $upper-bound][. ge $lower-bound]) 
    return if ($count gt 0) 
           then <search:facet-value name="bucket-{$upper-bound}" 
count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:facet-value>
           else ()

The get-enrichment-durations() function does this:

  cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending",
                     cts:collection-query($collection))

This works nicely and seems to provide correct numbers except when the number 
of durations within a particular set of bounds exceeds 10,000, at which point 
count() returns 10,000, which is an impossible number—the chance of there being 
exactly 10,000 instances within a given range is basically zero. But I’m 
getting 10,000 twice, which is absolutely impossible.

Here’s the results I get from running this in the query console:

<result>
<count-durations>75778</count-durations>
<facets>
<search:facet-value name="bucket-PT0.01S" count="3" 
xmlns:search="http://marklogic.com/appservices/search";>0.01 
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.02S" count="7280" 
xmlns:search="http://marklogic.com/appservices/search";>0.02 
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.03S" count="10000" 
xmlns:search="http://marklogic.com/appservices/search";>0.03 
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.04S" count="10000" 
xmlns:search="http://marklogic.com/appservices/search";>0.04 
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.05S" count="9984" 
xmlns:search="http://marklogic.com/appservices/search";>0.05 
seconds</search:facet-value>
 …
</facets>
</result>

There are 75,778 actual duration values and the count value for the 3rd and 4th 
ranges are exactly 10,000.

If I change the let $count := expression to only test the upper or lower bound 
then I get numbers greater than 10,000. I also tried changing the order of the 
predicates and using a single predicate with “and”. The problem only seems to 
be related to using both predicates when the resulting sequence would have more 
than 10K items.

Is there an explanation for why count() gives me exactly 10,000 in this case?

Is there a workaround for this behavior?

The search range I’m constructing is normal ML-defined markup for defining a 
search range, e.g.:

<search:range type="xs:dayTimeDuration" facet="true" 
xmlns:search="http://marklogic.com/appservices/search";>
<search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001 
Second</search:bucket>
<search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002 
Second</search:bucket>
<search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003 
Second</search:bucket>
<search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004 
Second</search:bucket>
<search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005 
Second</search:bucket>
…
</search:range>

Thanks,

Eliot
--
Eliot Kimber
http://contrext.com
 



_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] Getting Impossible Value from count()--why?

Reply via email to