If I sum the counts of each bucket calculated using cts:frequency() it matches 
the total calculated using the initial result from the element-values() query, 
so I guess the 10,000 count is a side effect of some internal lexicon 
implementation magic. 

Cheers,

E.

--
Eliot Kimber
http://contrext.com
 


On 8/22/17, 3:25 PM, "[email protected] on behalf of 
Eliot Kimber" <[email protected] on behalf of 
[email protected]> wrote:

    I think this is again my weak understanding of lexicons and frequency 
counting. 
    
    If I change my code to sum the frequencies of the durations in each range 
then I get more sensible numbers, e.g.:
    
    let $count := sum(for $dur in $durations[. lt $upper-bound][. ge 
$lower-bound] return cts:frequency($dur))
    
    Having updated get-enrichment-durations() to:
    
    cts:element-values(xs:QName("prof:overall-elapsed"), (), ("descending", 
"item-frequency"),
                             cts:collection-query($collection))
    
    It still seems odd that the pure lexicon check returns exactly 10,000 
*values*--that still seems suspect, but then using those 10,000 values to 
calculate the total frequency does result in a more likely number. I guess I 
can do some brute-force querying to see if it’s accurate.
    
    Cheers,
    
    Eliot
    --
    Eliot Kimber
    http://contrext.com
     
    
    
    On 8/22/17, 2:52 PM, "[email protected] on behalf of 
Eliot Kimber" <[email protected] on behalf of 
[email protected]> wrote:
    
        Using ML 8.0-3.2
        
        As part of my profiling application I run a large number of profiles, 
storing the profiler results back to the database. I’m then extracting the 
times from the profiling data to create histograms and do other analysis.
        
        My first attempt to do this with buckets ran into the problem that the 
index-based buckets were not returning accurate numbers, so I reimplemented it 
to construct the buckets manually from a list of the actual duration values.
        
        My code is:
        
        let $durations as xs:dayTimeDuration* := 
epf:get-enrichment-durations($collection)
        let $search-range := epf:construct-search-range()
        let $facets :=
            for $bucket in $search-range/search:bucket
            let $upper-bound := if ($bucket/@lt) then 
xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S")
            let $lower-bound := xs:dayTimeDuration($bucket/@ge)
            let $count := count($durations[. lt $upper-bound][. ge 
$lower-bound]) 
            return if ($count gt 0) 
                   then <search:facet-value name="bucket-{$upper-bound}" 
count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:facet-value>
                   else ()
        
        The get-enrichment-durations() function does this:
        
          cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending",
                             cts:collection-query($collection))
        
        This works nicely and seems to provide correct numbers except when the 
number of durations within a particular set of bounds exceeds 10,000, at which 
point count() returns 10,000, which is an impossible number—the chance of there 
being exactly 10,000 instances within a given range is basically zero. But I’m 
getting 10,000 twice, which is absolutely impossible.
        
        Here’s the results I get from running this in the query console:
        
        <result>
        <count-durations>75778</count-durations>
        <facets>
        <search:facet-value name="bucket-PT0.01S" count="3" 
xmlns:search="http://marklogic.com/appservices/search";>0.01 
seconds</search:facet-value>
        <search:facet-value name="bucket-PT0.02S" count="7280" 
xmlns:search="http://marklogic.com/appservices/search";>0.02 
seconds</search:facet-value>
        <search:facet-value name="bucket-PT0.03S" count="10000" 
xmlns:search="http://marklogic.com/appservices/search";>0.03 
seconds</search:facet-value>
        <search:facet-value name="bucket-PT0.04S" count="10000" 
xmlns:search="http://marklogic.com/appservices/search";>0.04 
seconds</search:facet-value>
        <search:facet-value name="bucket-PT0.05S" count="9984" 
xmlns:search="http://marklogic.com/appservices/search";>0.05 
seconds</search:facet-value>
         …
        </facets>
        </result>
        
        There are 75,778 actual duration values and the count value for the 3rd 
and 4th ranges are exactly 10,000.
        
        If I change the let $count := expression to only test the upper or 
lower bound then I get numbers greater than 10,000. I also tried changing the 
order of the predicates and using a single predicate with “and”. The problem 
only seems to be related to using both predicates when the resulting sequence 
would have more than 10K items.
        
        Is there an explanation for why count() gives me exactly 10,000 in this 
case?
        
        Is there a workaround for this behavior?
        
        The search range I’m constructing is normal ML-defined markup for 
defining a search range, e.g.:
        
        <search:range type="xs:dayTimeDuration" facet="true" 
xmlns:search="http://marklogic.com/appservices/search";>
        <search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001 
Second</search:bucket>
        <search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002 
Second</search:bucket>
        <search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003 
Second</search:bucket>
        <search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004 
Second</search:bucket>
        <search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005 
Second</search:bucket>
        …
        </search:range>
        
        Thanks,
        
        Eliot
        --
        Eliot Kimber
        http://contrext.com
         
        
        
        
        _______________________________________________
        General mailing list
        [email protected]
        Manage your subscription at: 
        http://developer.marklogic.com/mailman/listinfo/general
        
    
    
    _______________________________________________
    General mailing list
    [email protected]
    Manage your subscription at: 
    http://developer.marklogic.com/mailman/listinfo/general
    


_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to