I think this is again my weak understanding of lexicons and frequency counting.
If I change my code to sum the frequencies of the durations in each range then
I get more sensible numbers, e.g.:
let $count := sum(for $dur in $durations[. lt $upper-bound][. ge $lower-bound]
return cts:frequency($dur))
Having updated get-enrichment-durations() to:
cts:element-values(xs:QName("prof:overall-elapsed"), (), ("descending",
"item-frequency"),
cts:collection-query($collection))
It still seems odd that the pure lexicon check returns exactly 10,000
*values*--that still seems suspect, but then using those 10,000 values to
calculate the total frequency does result in a more likely number. I guess I
can do some brute-force querying to see if it’s accurate.
Cheers,
Eliot
--
Eliot Kimber
http://contrext.com
On 8/22/17, 2:52 PM, "[email protected] on behalf of
Eliot Kimber" <[email protected] on behalf of
[email protected]> wrote:
Using ML 8.0-3.2
As part of my profiling application I run a large number of profiles,
storing the profiler results back to the database. I’m then extracting the
times from the profiling data to create histograms and do other analysis.
My first attempt to do this with buckets ran into the problem that the
index-based buckets were not returning accurate numbers, so I reimplemented it
to construct the buckets manually from a list of the actual duration values.
My code is:
let $durations as xs:dayTimeDuration* :=
epf:get-enrichment-durations($collection)
let $search-range := epf:construct-search-range()
let $facets :=
for $bucket in $search-range/search:bucket
let $upper-bound := if ($bucket/@lt) then
xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S")
let $lower-bound := xs:dayTimeDuration($bucket/@ge)
let $count := count($durations[. lt $upper-bound][. ge $lower-bound])
return if ($count gt 0)
then <search:facet-value name="bucket-{$upper-bound}"
count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:facet-value>
else ()
The get-enrichment-durations() function does this:
cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending",
cts:collection-query($collection))
This works nicely and seems to provide correct numbers except when the
number of durations within a particular set of bounds exceeds 10,000, at which
point count() returns 10,000, which is an impossible number—the chance of there
being exactly 10,000 instances within a given range is basically zero. But I’m
getting 10,000 twice, which is absolutely impossible.
Here’s the results I get from running this in the query console:
<result>
<count-durations>75778</count-durations>
<facets>
<search:facet-value name="bucket-PT0.01S" count="3"
xmlns:search="http://marklogic.com/appservices/search">0.01
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.02S" count="7280"
xmlns:search="http://marklogic.com/appservices/search">0.02
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.03S" count="10000"
xmlns:search="http://marklogic.com/appservices/search">0.03
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.04S" count="10000"
xmlns:search="http://marklogic.com/appservices/search">0.04
seconds</search:facet-value>
<search:facet-value name="bucket-PT0.05S" count="9984"
xmlns:search="http://marklogic.com/appservices/search">0.05
seconds</search:facet-value>
…
</facets>
</result>
There are 75,778 actual duration values and the count value for the 3rd and
4th ranges are exactly 10,000.
If I change the let $count := expression to only test the upper or lower
bound then I get numbers greater than 10,000. I also tried changing the order
of the predicates and using a single predicate with “and”. The problem only
seems to be related to using both predicates when the resulting sequence would
have more than 10K items.
Is there an explanation for why count() gives me exactly 10,000 in this
case?
Is there a workaround for this behavior?
The search range I’m constructing is normal ML-defined markup for defining
a search range, e.g.:
<search:range type="xs:dayTimeDuration" facet="true"
xmlns:search="http://marklogic.com/appservices/search">
<search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001
Second</search:bucket>
<search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002
Second</search:bucket>
<search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003
Second</search:bucket>
<search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004
Second</search:bucket>
<search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005
Second</search:bucket>
…
</search:range>
Thanks,
Eliot
--
Eliot Kimber
http://contrext.com
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general