If I sum the counts of each bucket calculated using cts:frequency() it matches the total calculated using the initial result from the element-values() query, so I guess the 10,000 count is a side effect of some internal lexicon implementation magic.
Cheers, E. -- Eliot Kimber http://contrext.com On 8/22/17, 3:25 PM, "[email protected] on behalf of Eliot Kimber" <[email protected] on behalf of [email protected]> wrote: I think this is again my weak understanding of lexicons and frequency counting. If I change my code to sum the frequencies of the durations in each range then I get more sensible numbers, e.g.: let $count := sum(for $dur in $durations[. lt $upper-bound][. ge $lower-bound] return cts:frequency($dur)) Having updated get-enrichment-durations() to: cts:element-values(xs:QName("prof:overall-elapsed"), (), ("descending", "item-frequency"), cts:collection-query($collection)) It still seems odd that the pure lexicon check returns exactly 10,000 *values*--that still seems suspect, but then using those 10,000 values to calculate the total frequency does result in a more likely number. I guess I can do some brute-force querying to see if it’s accurate. Cheers, Eliot -- Eliot Kimber http://contrext.com On 8/22/17, 2:52 PM, "[email protected] on behalf of Eliot Kimber" <[email protected] on behalf of [email protected]> wrote: Using ML 8.0-3.2 As part of my profiling application I run a large number of profiles, storing the profiler results back to the database. I’m then extracting the times from the profiling data to create histograms and do other analysis. My first attempt to do this with buckets ran into the problem that the index-based buckets were not returning accurate numbers, so I reimplemented it to construct the buckets manually from a list of the actual duration values. My code is: let $durations as xs:dayTimeDuration* := epf:get-enrichment-durations($collection) let $search-range := epf:construct-search-range() let $facets := for $bucket in $search-range/search:bucket let $upper-bound := if ($bucket/@lt) then xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S") let $lower-bound := xs:dayTimeDuration($bucket/@ge) let $count := count($durations[. lt $upper-bound][. ge $lower-bound]) return if ($count gt 0) then <search:facet-value name="bucket-{$upper-bound}" count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:facet-value> else () The get-enrichment-durations() function does this: cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending", cts:collection-query($collection)) This works nicely and seems to provide correct numbers except when the number of durations within a particular set of bounds exceeds 10,000, at which point count() returns 10,000, which is an impossible number—the chance of there being exactly 10,000 instances within a given range is basically zero. But I’m getting 10,000 twice, which is absolutely impossible. Here’s the results I get from running this in the query console: <result> <count-durations>75778</count-durations> <facets> <search:facet-value name="bucket-PT0.01S" count="3" xmlns:search="http://marklogic.com/appservices/search">0.01 seconds</search:facet-value> <search:facet-value name="bucket-PT0.02S" count="7280" xmlns:search="http://marklogic.com/appservices/search">0.02 seconds</search:facet-value> <search:facet-value name="bucket-PT0.03S" count="10000" xmlns:search="http://marklogic.com/appservices/search">0.03 seconds</search:facet-value> <search:facet-value name="bucket-PT0.04S" count="10000" xmlns:search="http://marklogic.com/appservices/search">0.04 seconds</search:facet-value> <search:facet-value name="bucket-PT0.05S" count="9984" xmlns:search="http://marklogic.com/appservices/search">0.05 seconds</search:facet-value> … </facets> </result> There are 75,778 actual duration values and the count value for the 3rd and 4th ranges are exactly 10,000. If I change the let $count := expression to only test the upper or lower bound then I get numbers greater than 10,000. I also tried changing the order of the predicates and using a single predicate with “and”. The problem only seems to be related to using both predicates when the resulting sequence would have more than 10K items. Is there an explanation for why count() gives me exactly 10,000 in this case? Is there a workaround for this behavior? The search range I’m constructing is normal ML-defined markup for defining a search range, e.g.: <search:range type="xs:dayTimeDuration" facet="true" xmlns:search="http://marklogic.com/appservices/search"> <search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001 Second</search:bucket> <search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002 Second</search:bucket> <search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003 Second</search:bucket> <search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004 Second</search:bucket> <search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005 Second</search:bucket> … </search:range> Thanks, Eliot -- Eliot Kimber http://contrext.com _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
