Using ML 8.0-3.2
As part of my profiling application I run a large number of profiles, storing
the profiler results back to the database. I’m then extracting the times from
the profiling data to create histograms and do other analysis.
My first attempt to do this with buckets ran into the problem that the
index-based buckets were not returning accurate numbers, so I reimplemented it
to construct the buckets manually from a list of the actual duration values.
My code is:
let $durations as xs:dayTimeDuration* :=
epf:get-enrichment-durations($collection)
let $search-range := epf:construct-search-range()
let $facets :=
for $bucket in $search-range/search:bucket
let $upper-bound := if ($bucket/@lt) then xs:dayTimeDuration($bucket/@lt)
else xs:dayTimeDuration("PT0S")
let $lower-bound := xs:dayTimeDuration($bucket/@ge)
let $count := count($durations[. lt $upper-bound][. ge $lower-bound])
return if ($count gt 0)
then <search:facet-value name="bucket-{$upper-bound}"
count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:facet-value>
else ()
The get-enrichment-durations() function does this:
cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending",
cts:collection-query($collection))
This works nicely and seems to provide correct numbers except when the number
of durations within a particular set of bounds exceeds 10,000, at which point
count() returns 10,000, which is an impossible number—the chance of there being
exactly 10,000 instances within a given range is basically zero. But I’m
getting 10,000 twice, which is absolutely impossible.
Here’s the results I get from running this in the query console:
<result>
<count-durations>75778</count-durations>
<facets>
<search:facet-value name="bucket-PT0.01S" count="3"
xmlns:search="http://marklogic.com/appservices/search">0.01
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.02S" count="7280"
xmlns:search="http://marklogic.com/appservices/search">0.02
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.03S" count="10000"
xmlns:search="http://marklogic.com/appservices/search">0.03
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.04S" count="10000"
xmlns:search="http://marklogic.com/appservices/search">0.04
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.05S" count="9984"
xmlns:search="http://marklogic.com/appservices/search">0.05
seconds</search:facet-value>
…
</facets>
</result>
There are 75,778 actual duration values and the count value for the 3rd and 4th
ranges are exactly 10,000.
If I change the let $count := expression to only test the upper or lower bound
then I get numbers greater than 10,000. I also tried changing the order of the
predicates and using a single predicate with “and”. The problem only seems to
be related to using both predicates when the resulting sequence would have more
than 10K items.
Is there an explanation for why count() gives me exactly 10,000 in this case?
Is there a workaround for this behavior?
The search range I’m constructing is normal ML-defined markup for defining a
search range, e.g.:
<search:range type="xs:dayTimeDuration" facet="true"
xmlns:search="http://marklogic.com/appservices/search">
<search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001
Second</search:bucket>
<search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002
Second</search:bucket>
<search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003
Second</search:bucket>
<search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004
Second</search:bucket>
<search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005
Second</search:bucket>
…
</search:range>
Thanks,
Eliot
--
Eliot Kimber
http://contrext.com
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general