You should also take a look at cts:sum-aggregate() <https://docs.marklogic.com/cts:sum-aggregate>. Unlike fn:sum(), which pulls values from the D-nodes to the E-node where it’s being executed, cts:sum-aggregate() allows you to scope the sum calculation by a query and delegate the aggregate calculations to the D-nodes in a cluster. The E-node needs only sum the partial aggregates. (It uses the underlying map-reduce framework available for UDFs <https://docs.marklogic.com/guide/app-dev/aggregateUDFs>.) This allows for more parallelization and less intermediate data being shuffled among nodes in a cluster. All of the aggregate functions in the cts namespace operate this way. Although some aggregates are more parallelizable than others.
Justin > On Aug 23, 2017, at 7:34 AM, Eliot Kimber <[email protected]> wrote: > > Yes, I added “item-frequency” to my cts:element-values() call and now all the > numbers appear to be correct. > > I haven’t circled back to my original issue with the ML-provided search > buckets not being the right size but if I have time I’ll see if the issue was > failing to specify item-frequency at some point. > > Cheers, > > E. > ----- > On 8/23/17, 2:37 AM, "[email protected] on behalf of > Geert Josten" <[email protected] on behalf of > [email protected]> wrote: > > Hi Eliot, > > Keep in mind that you pass in item-frequency in cts:element-values, but > the default for range constraints is likely fragment-frequency. Did you > pass in an item-frequency facet-option in there too? > > Kind regards, > Geert > > On 8/22/17, 10:47 PM, "[email protected] on behalf > of Eliot Kimber" <[email protected] on behalf of > [email protected]> wrote: > >> If I sum the counts of each bucket calculated using cts:frequency() it >> matches the total calculated using the initial result from the >> element-values() query, so I guess the 10,000 count is a side effect of >> some internal lexicon implementation magic. >> >> Cheers, >> >> E. >> >> -- >> Eliot Kimber >> http://contrext.com >> >> >> >> On 8/22/17, 3:25 PM, "[email protected] on behalf >> of Eliot Kimber" <[email protected] on behalf of >> [email protected]> wrote: >> >> I think this is again my weak understanding of lexicons and frequency >> counting. >> >> If I change my code to sum the frequencies of the durations in each >> range then I get more sensible numbers, e.g.: >> >> let $count := sum(for $dur in $durations[. lt $upper-bound][. ge >> $lower-bound] return cts:frequency($dur)) >> >> Having updated get-enrichment-durations() to: >> >> cts:element-values(xs:QName("prof:overall-elapsed"), (), >> ("descending", "item-frequency"), >> cts:collection-query($collection)) >> >> It still seems odd that the pure lexicon check returns exactly 10,000 >> *values*--that still seems suspect, but then using those 10,000 values to >> calculate the total frequency does result in a more likely number. I >> guess I can do some brute-force querying to see if it¹s accurate. >> >> Cheers, >> >> Eliot >> -- >> Eliot Kimber >> http://contrext.com >> >> >> >> On 8/22/17, 2:52 PM, "[email protected] on >> behalf of Eliot Kimber" <[email protected] on >> behalf of [email protected]> wrote: >> >> Using ML 8.0-3.2 >> >> As part of my profiling application I run a large number of >> profiles, storing the profiler results back to the database. I¹m then >> extracting the times from the profiling data to create histograms and do >> other analysis. >> >> My first attempt to do this with buckets ran into the problem >> that the index-based buckets were not returning accurate numbers, so I >> reimplemented it to construct the buckets manually from a list of the >> actual duration values. >> >> My code is: >> >> let $durations as xs:dayTimeDuration* := >> epf:get-enrichment-durations($collection) >> let $search-range := epf:construct-search-range() >> let $facets := >> for $bucket in $search-range/search:bucket >> let $upper-bound := if ($bucket/@lt) then >> xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S") >> let $lower-bound := xs:dayTimeDuration($bucket/@ge) >> let $count := count($durations[. lt $upper-bound][. ge >> $lower-bound]) >> return if ($count gt 0) >> then <search:facet-value name="bucket-{$upper-bound}" >> count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:face >> t-value> >> else () >> >> The get-enrichment-durations() function does this: >> >> cts:element-values(xs:QName("prof:overall-elapsed"), (), >> "descending", >> cts:collection-query($collection)) >> >> This works nicely and seems to provide correct numbers except >> when the number of durations within a particular set of bounds exceeds >> 10,000, at which point count() returns 10,000, which is an impossible >> number‹the chance of there being exactly 10,000 instances within a given >> range is basically zero. But I¹m getting 10,000 twice, which is >> absolutely impossible. >> >> Here¹s the results I get from running this in the query console: >> >> <result> >> <count-durations>75778</count-durations> >> <facets> >> <search:facet-value name="bucket-PT0.01S" count="3" >> xmlns:search="http://marklogic.com/appservices/search">0.01 >> seconds</search:facet-value> >> <search:facet-value name="bucket-PT0.02S" count="7280" >> xmlns:search="http://marklogic.com/appservices/search">0.02 >> seconds</search:facet-value> >> <search:facet-value name="bucket-PT0.03S" count="10000" >> xmlns:search="http://marklogic.com/appservices/search">0.03 >> seconds</search:facet-value> >> <search:facet-value name="bucket-PT0.04S" count="10000" >> xmlns:search="http://marklogic.com/appservices/search">0.04 >> seconds</search:facet-value> >> <search:facet-value name="bucket-PT0.05S" count="9984" >> xmlns:search="http://marklogic.com/appservices/search">0.05 >> seconds</search:facet-value> >> Š >> </facets> >> </result> >> >> There are 75,778 actual duration values and the count value for >> the 3rd and 4th ranges are exactly 10,000. >> >> If I change the let $count := expression to only test the upper >> or lower bound then I get numbers greater than 10,000. I also tried >> changing the order of the predicates and using a single predicate with >> ³and². The problem only seems to be related to using both predicates when >> the resulting sequence would have more than 10K items. >> >> Is there an explanation for why count() gives me exactly 10,000 >> in this case? >> >> Is there a workaround for this behavior? >> >> The search range I¹m constructing is normal ML-defined markup for >> defining a search range, e.g.: >> >> <search:range type="xs:dayTimeDuration" facet="true" >> xmlns:search="http://marklogic.com/appservices/search"> >> <search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001 >> Second</search:bucket> >> <search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002 >> Second</search:bucket> >> <search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003 >> Second</search:bucket> >> <search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004 >> Second</search:bucket> >> <search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005 >> Second</search:bucket> >> Š >> </search:range> >> >> Thanks, >> >> Eliot >> -- >> Eliot Kimber >> http://contrext.com >> >> >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general >> >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general >> >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > > > > > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
