Re: [MarkLogic Dev General] Getting Impossible Value from count()--why?

Justin Makeig Wed, 23 Aug 2017 09:23:08 -0700

You should also take a look at cts:sum-aggregate() 
<https://docs.marklogic.com/cts:sum-aggregate>. Unlike fn:sum(), which pulls 
values from the D-nodes to the E-node where it’s being executed, 
cts:sum-aggregate() allows you to scope the sum calculation by a query and 
delegate the aggregate calculations to the D-nodes in a cluster. The E-node 
needs only sum the partial aggregates. (It uses the underlying map-reduce 
framework available for UDFs 
<https://docs.marklogic.com/guide/app-dev/aggregateUDFs>.) This allows for more 
parallelization and less intermediate data being shuffled among nodes in a 
cluster. All of the aggregate functions in the cts namespace operate this way. 
Although some aggregates are more parallelizable than others.


Justin


> On Aug 23, 2017, at 7:34 AM, Eliot Kimber <[email protected]> wrote:
> 
> Yes, I added “item-frequency” to my cts:element-values() call and now all the 
> numbers appear to be correct.
> 
> I haven’t circled back to my original issue with the ML-provided search 
> buckets not being the right size but if I have time I’ll see if the issue was 
> failing to specify item-frequency at some point.
> 
> Cheers,
> 
> E.
> -----
> On 8/23/17, 2:37 AM, "[email protected] on behalf of 
> Geert Josten" <[email protected] on behalf of 
> [email protected]> wrote:
> 
>    Hi Eliot,
> 
>    Keep in mind that you pass in item-frequency in cts:element-values, but
>    the default for range constraints is likely fragment-frequency. Did you
>    pass in an item-frequency facet-option in there too?
> 
>    Kind regards,
>    Geert
> 
>    On 8/22/17, 10:47 PM, "[email protected] on behalf
>    of Eliot Kimber" <[email protected] on behalf of
>    [email protected]> wrote:
> 
>> If I sum the counts of each bucket calculated using cts:frequency() it
>> matches the total calculated using the initial result from the
>> element-values() query, so I guess the 10,000 count is a side effect of
>> some internal lexicon implementation magic.
>> 
>> Cheers,
>> 
>> E.
>> 
>> --
>> Eliot Kimber
>> http://contrext.com
>> 
>> 
>> 
>> On 8/22/17, 3:25 PM, "[email protected] on behalf
>> of Eliot Kimber" <[email protected] on behalf of
>> [email protected]> wrote:
>> 
>>   I think this is again my weak understanding of lexicons and frequency
>> counting. 
>> 
>>   If I change my code to sum the frequencies of the durations in each
>> range then I get more sensible numbers, e.g.:
>> 
>>   let $count := sum(for $dur in $durations[. lt $upper-bound][. ge
>> $lower-bound] return cts:frequency($dur))
>> 
>>   Having updated get-enrichment-durations() to:
>> 
>>   cts:element-values(xs:QName("prof:overall-elapsed"), (),
>> ("descending", "item-frequency"),
>>                            cts:collection-query($collection))
>> 
>>   It still seems odd that the pure lexicon check returns exactly 10,000
>> *values*--that still seems suspect, but then using those 10,000 values to
>> calculate the total frequency does result in a more likely number. I
>> guess I can do some brute-force querying to see if it¹s accurate.
>> 
>>   Cheers,
>> 
>>   Eliot
>>   --
>>   Eliot Kimber
>>   http://contrext.com
>> 
>> 
>> 
>>   On 8/22/17, 2:52 PM, "[email protected] on
>> behalf of Eliot Kimber" <[email protected] on
>> behalf of [email protected]> wrote:
>> 
>>       Using ML 8.0-3.2
>> 
>>       As part of my profiling application I run a large number of
>> profiles, storing the profiler results back to the database. I¹m then
>> extracting the times from the profiling data to create histograms and do
>> other analysis.
>> 
>>       My first attempt to do this with buckets ran into the problem
>> that the index-based buckets were not returning accurate numbers, so I
>> reimplemented it to construct the buckets manually from a list of the
>> actual duration values.
>> 
>>       My code is:
>> 
>>       let $durations as xs:dayTimeDuration* :=
>> epf:get-enrichment-durations($collection)
>>       let $search-range := epf:construct-search-range()
>>       let $facets :=
>>           for $bucket in $search-range/search:bucket
>>           let $upper-bound := if ($bucket/@lt) then
>> xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S")
>>           let $lower-bound := xs:dayTimeDuration($bucket/@ge)
>>           let $count := count($durations[. lt $upper-bound][. ge
>> $lower-bound]) 
>>           return if ($count gt 0)
>>                  then <search:facet-value name="bucket-{$upper-bound}"
>> count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:face
>> t-value>
>>                  else ()
>> 
>>       The get-enrichment-durations() function does this:
>> 
>>         cts:element-values(xs:QName("prof:overall-elapsed"), (),
>> "descending",
>>                            cts:collection-query($collection))
>> 
>>       This works nicely and seems to provide correct numbers except
>> when the number of durations within a particular set of bounds exceeds
>> 10,000, at which point count() returns 10,000, which is an impossible
>> number‹the chance of there being exactly 10,000 instances within a given
>> range is basically zero. But I¹m getting 10,000 twice, which is
>> absolutely impossible.
>> 
>>       Here¹s the results I get from running this in the query console:
>> 
>>       <result>
>>       <count-durations>75778</count-durations>
>>       <facets>
>>       <search:facet-value name="bucket-PT0.01S" count="3"
>> xmlns:search="http://marklogic.com/appservices/search";>0.01
>> seconds</search:facet-value>
>>       <search:facet-value name="bucket-PT0.02S" count="7280"
>> xmlns:search="http://marklogic.com/appservices/search";>0.02
>> seconds</search:facet-value>
>>       <search:facet-value name="bucket-PT0.03S" count="10000"
>> xmlns:search="http://marklogic.com/appservices/search";>0.03
>> seconds</search:facet-value>
>>       <search:facet-value name="bucket-PT0.04S" count="10000"
>> xmlns:search="http://marklogic.com/appservices/search";>0.04
>> seconds</search:facet-value>
>>       <search:facet-value name="bucket-PT0.05S" count="9984"
>> xmlns:search="http://marklogic.com/appservices/search";>0.05
>> seconds</search:facet-value>
>>        Š
>>       </facets>
>>       </result>
>> 
>>       There are 75,778 actual duration values and the count value for
>> the 3rd and 4th ranges are exactly 10,000.
>> 
>>       If I change the let $count := expression to only test the upper
>> or lower bound then I get numbers greater than 10,000. I also tried
>> changing the order of the predicates and using a single predicate with
>> ³and². The problem only seems to be related to using both predicates when
>> the resulting sequence would have more than 10K items.
>> 
>>       Is there an explanation for why count() gives me exactly 10,000
>> in this case?
>> 
>>       Is there a workaround for this behavior?
>> 
>>       The search range I¹m constructing is normal ML-defined markup for
>> defining a search range, e.g.:
>> 
>>       <search:range type="xs:dayTimeDuration" facet="true"
>> xmlns:search="http://marklogic.com/appservices/search";>
>>       <search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001
>> Second</search:bucket>
>>       <search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002
>> Second</search:bucket>
>>       <search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003
>> Second</search:bucket>
>>       <search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004
>> Second</search:bucket>
>>       <search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005
>> Second</search:bucket>
>>       Š
>>       </search:range>
>> 
>>       Thanks,
>> 
>>       Eliot
>>       --
>>       Eliot Kimber
>>       http://contrext.com
>> 
>> 
>> 
>> 
>>       _______________________________________________
>>       General mailing list
>>       [email protected]
>>       Manage your subscription at:
>>       http://developer.marklogic.com/mailman/listinfo/general
>> 
>> 
>> 
>>   _______________________________________________
>>   General mailing list
>>   [email protected]
>>   Manage your subscription at:
>>   http://developer.marklogic.com/mailman/listinfo/general
>> 
>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> Manage your subscription at:
>> http://developer.marklogic.com/mailman/listinfo/general
> 
>    _______________________________________________
>    General mailing list
>    [email protected]
>    Manage your subscription at: 
>    http://developer.marklogic.com/mailman/listinfo/general
> 
> 
> 
> 
> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at: 
> http://developer.marklogic.com/mailman/listinfo/general

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Getting Impossible Value from count()--why?

Reply via email to