Thanks Geert! I will try this tomorrow!
Regards, Johan On Mon, Jun 29, 2015 at 3:25 PM Geert Josten <[email protected]> wrote: > I ran some profiles as well, I think your profile results could have > been a bit misleading. I suspect walking over uris, and summing frequencies > is one of the major slow parts. I managed to create a UDF though (my first > one). It is relatively generic, and should allow summing frequencies of any > elem/attrib. You can download it from here: > > https://github.com/grtjn/doc-count-udf > > Get/clone it, run `make` (pref on the target env), and follow > instructions to install it. After that you can run: > > let $uris := cts:aggregate( > "gjosten/doc-count", > "doc-count", > ( > cts:uri-reference(), > cts:element-attribute-reference(xs:QName("file"), xs:QName("size")) > ) > ) > let $counts := -$uris > let $top-keys := > for $key in map:keys($counts) > order by xs:int($key) descending > return $key > return ( > for $key in $top-keys > for $value in map:get($counts, $key) > return $value || " - " || $key > )[1 to 10] > > I tested with 1k docs, and my earlier tuples approach took 14 sec with > that, less than 1 sec with this.. > > Cheers, > Geert > > From: Johan Mörén <[email protected]> > > Reply-To: MarkLogic Developer Discussion <[email protected]> > Date: Sunday, June 28, 2015 at 12:07 AM > > To: MarkLogic Developer Discussion <[email protected]> > Subject: Re: [MarkLogic Dev General] Find the document(s) with max > occurrences of an element-attribute reference > > Thanks again for looking into this Geert! > > I tried a mix of your approach (minus the -$uris part) and mine and got > better results. But that will not give me the ability to sort the whole > database based on occurrence. Just got me the document(s) with the maximum > number of occurrences. I tried this query in production where we have 1.4 > million documents and the total number of file-elements is roughly 25 > million. Got the result back in about 3 minutes. So it was definitely an > improvement. But it will not scale over time. Thanks for looking down the > UDF path. Hopefully this could lead to a more general an useful approach. > > Cheers, > Johan > > On Sat, Jun 27, 2015 at 8:06 PM Geert Josten <[email protected]> > wrote: > >> My approach was similar, but tried to sum all frequencies per uri. >> Unfortunately, that approach gets slower with more documents, and more >> distinct file sizes. Adding a simple count attribute or element in the file >> somewhere would greatly simplify the run-time calculation, and that is what >> I would normally recommend. For the sake of completeness I’ll give it some >> more thought to see if there are ways to improve on the 3 minutes. A UDF >> might be useful, would have to try that.. >> >> Cheers, >> Geert >> >> From: Johan Mörén <[email protected]> >> Reply-To: MarkLogic Developer Discussion <[email protected] >> > >> Date: Saturday, June 27, 2015 at 1:23 AM >> To: MarkLogic Developer Discussion <[email protected]> >> Subject: Re: [MarkLogic Dev General] Find the document(s) with max >> occurrences of an element-attribute reference >> >> Hi Christopher >> >> I tried your approach but still without success. I think the case might >> be that your example is using a fixed vale for size ("yes"). And since >> frequency is based on the the value you get the right results. >> >> Regards, >> Johan >> >> >> >> On Sat, Jun 27, 2015 at 12:34 AM Christopher Hamlin <[email protected]> >> wrote: >> >>> Hi Johan, >>> >>> Maybe I'm not clear on what you want. >>> >>> I just tried something. I created documents in a database using >>> >>> xquery version "1.0-ml"; >>> for $i in 1 to 100 >>> let $doc := <doc>{(1 to $i)!<file size='yes'/>}</doc> >>> let $uri := '/'||$i||'.xml' >>> return xdmp:document-insert ($uri, $doc) >>> >>> so for example >>> >>> /1.xml => >>> >>> <doc> >>> <file size="yes"/> >>> </doc> >>> >>> and >>> >>> /2.xml => >>> >>> <doc> >>> <file size="yes"/> >>> <file size="yes"/> >>> </doc> >>> >>> and so on. >>> >>> With a file/@size element-attribute range index, the query >>> >>> xquery version '1.0-ml'; >>> let $uris := cts:uri-reference() >>> let $ea := cts:element-attribute-reference (xs:QName ('file'), >>> xs:QName ('size'), >>> 'collation=http://marklogic.com/collation/codepoint') >>> return >>> for $tuple in cts:value-tuples(($uris, $ea), >>> ('item-frequency','frequency-order','descending','limit=3')) >>> return fn:concat ($tuple[1], ' -> ', cts:frequency ($tuple)) >>> >>> returns >>> >>> /100.xml -> 100 >>> /99.xml -> 99 >>> /98.xml -> 98 >>> /97.xml -> 97 >>> /96.xml -> 96 >>> /95.xml -> 95 >>> /94.xml -> 94 >>> /93.xml -> 93 >>> /92.xml -> 92 >>> /91.xml -> 91 >>> >>> Is this close to what you want? >>> >>> Regards, >>> >>> Chris >>> >>> On Fri, Jun 26, 2015 at 12:41 PM, Johan Mörén <[email protected]> >>> wrote: >>> > Hi Christopher! >>> > >>> > I'm not sure where you wan't me to use these options. But i tried to >>> add >>> > them to the cts:value-tuples() but that did not return the expected >>> result. >>> > >>> > like this >>> > >>> > ... >>> > for $tuple in >>> > cts:value-tuples( >>> > ( >>> > cts:uri-reference(), >>> > $sizeRef >>> > ), >>> > ("frequency-order","descending","limit=10") >>> > >>> > ) >>> > ... >>> > >>> > Regards, >>> > Johan >>> > >>> > On Fri, Jun 26, 2015 at 5:58 PM Christopher Hamlin <[email protected] >>> > >>> > wrote: >>> >> >>> >> If you just want something like top ten, I think it's more direct >>> >> possibly. >>> >> >>> >> Can you try returning frequency-order, descending, limit=10? Are those >>> >> options you can use? >>> >> >>> >> _______________________________________________ >>> >> General mailing list >>> >> [email protected] >>> >> Manage your subscription at: >>> >> http://developer.marklogic.com/mailman/listinfo/general >>> > >>> > >>> > _______________________________________________ >>> > General mailing list >>> > [email protected] >>> > Manage your subscription at: >>> > http://developer.marklogic.com/mailman/listinfo/general >>> > >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> Manage your subscription at: >>> http://developer.marklogic.com/mailman/listinfo/general >>> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general >> > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general >
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
