Hi Geert! Compiled it and tested it locally ( on a mac ) and i runs fine. Don't have that many documents locally, roughly 700 documents. It returns in 0.08 seconds if i flip the returned map. If i don't flip it the operation takes 0.004 seconds. So i think the (-) operator is relatively expensive.
Right now i don't have the possibility to compile and install this in our test-environment (redhat). Our devops-guy is on holiday. Thanks for the help! I learned a lot on the way :) These are the queries i ran (: with flip *PT0.080913S *:) xquery version "1.0-ml"; declare namespace html = "http://www.w3.org/1999/xhtml"; declare namespace mets = "http://www.loc.gov/METS/"; let $map := cts:aggregate( "grtjn/doc-count", "doc-count", ( cts:uri-reference(), cts:element-attribute-reference(xs:QName("mets:file"), xs:QName("SIZE")) ) ) return -$map; (: without filp *PT0.004229S *:) xquery version "1.0-ml"; declare namespace html = "http://www.w3.org/1999/xhtml"; declare namespace mets = "http://www.loc.gov/METS/"; let $map := cts:aggregate( "grtjn/doc-count", "doc-count", ( cts:uri-reference(), cts:element-attribute-reference(xs:QName("mets:file"), xs:QName("SIZE")) ) ) return $map; Regards! Johan On Mon, Jun 29, 2015 at 3:38 PM Johan Mörén <[email protected]> wrote: > Thanks Geert! > > I will try this tomorrow! > > Regards, > Johan > > On Mon, Jun 29, 2015 at 3:25 PM Geert Josten <[email protected]> > wrote: > >> I ran some profiles as well, I think your profile results could have >> been a bit misleading. I suspect walking over uris, and summing frequencies >> is one of the major slow parts. I managed to create a UDF though (my first >> one). It is relatively generic, and should allow summing frequencies of any >> elem/attrib. You can download it from here: >> >> https://github.com/grtjn/doc-count-udf >> >> Get/clone it, run `make` (pref on the target env), and follow >> instructions to install it. After that you can run: >> >> let $uris := cts:aggregate( >> "gjosten/doc-count", >> "doc-count", >> ( >> cts:uri-reference(), >> cts:element-attribute-reference(xs:QName("file"), xs:QName("size")) >> ) >> ) >> let $counts := -$uris >> let $top-keys := >> for $key in map:keys($counts) >> order by xs:int($key) descending >> return $key >> return ( >> for $key in $top-keys >> for $value in map:get($counts, $key) >> return $value || " - " || $key >> )[1 to 10] >> >> I tested with 1k docs, and my earlier tuples approach took 14 sec with >> that, less than 1 sec with this.. >> >> Cheers, >> Geert >> >> From: Johan Mörén <[email protected]> >> >> Reply-To: MarkLogic Developer Discussion <[email protected] >> > >> Date: Sunday, June 28, 2015 at 12:07 AM >> >> To: MarkLogic Developer Discussion <[email protected]> >> Subject: Re: [MarkLogic Dev General] Find the document(s) with max >> occurrences of an element-attribute reference >> >> Thanks again for looking into this Geert! >> >> I tried a mix of your approach (minus the -$uris part) and mine and >> got better results. But that will not give me the ability to sort the whole >> database based on occurrence. Just got me the document(s) with the maximum >> number of occurrences. I tried this query in production where we have 1.4 >> million documents and the total number of file-elements is roughly 25 >> million. Got the result back in about 3 minutes. So it was definitely an >> improvement. But it will not scale over time. Thanks for looking down the >> UDF path. Hopefully this could lead to a more general an useful approach. >> >> Cheers, >> Johan >> >> On Sat, Jun 27, 2015 at 8:06 PM Geert Josten <[email protected]> >> wrote: >> >>> My approach was similar, but tried to sum all frequencies per uri. >>> Unfortunately, that approach gets slower with more documents, and more >>> distinct file sizes. Adding a simple count attribute or element in the file >>> somewhere would greatly simplify the run-time calculation, and that is what >>> I would normally recommend. For the sake of completeness I’ll give it some >>> more thought to see if there are ways to improve on the 3 minutes. A UDF >>> might be useful, would have to try that.. >>> >>> Cheers, >>> Geert >>> >>> From: Johan Mörén <[email protected]> >>> Reply-To: MarkLogic Developer Discussion < >>> [email protected]> >>> Date: Saturday, June 27, 2015 at 1:23 AM >>> To: MarkLogic Developer Discussion <[email protected]> >>> Subject: Re: [MarkLogic Dev General] Find the document(s) with max >>> occurrences of an element-attribute reference >>> >>> Hi Christopher >>> >>> I tried your approach but still without success. I think the case >>> might be that your example is using a fixed vale for size ("yes"). And >>> since frequency is based on the the value you get the right results. >>> >>> Regards, >>> Johan >>> >>> >>> >>> On Sat, Jun 27, 2015 at 12:34 AM Christopher Hamlin <[email protected]> >>> wrote: >>> >>>> Hi Johan, >>>> >>>> Maybe I'm not clear on what you want. >>>> >>>> I just tried something. I created documents in a database using >>>> >>>> xquery version "1.0-ml"; >>>> for $i in 1 to 100 >>>> let $doc := <doc>{(1 to $i)!<file size='yes'/>}</doc> >>>> let $uri := '/'||$i||'.xml' >>>> return xdmp:document-insert ($uri, $doc) >>>> >>>> so for example >>>> >>>> /1.xml => >>>> >>>> <doc> >>>> <file size="yes"/> >>>> </doc> >>>> >>>> and >>>> >>>> /2.xml => >>>> >>>> <doc> >>>> <file size="yes"/> >>>> <file size="yes"/> >>>> </doc> >>>> >>>> and so on. >>>> >>>> With a file/@size element-attribute range index, the query >>>> >>>> xquery version '1.0-ml'; >>>> let $uris := cts:uri-reference() >>>> let $ea := cts:element-attribute-reference (xs:QName ('file'), >>>> xs:QName ('size'), >>>> 'collation=http://marklogic.com/collation/codepoint') >>>> return >>>> for $tuple in cts:value-tuples(($uris, $ea), >>>> ('item-frequency','frequency-order','descending','limit=3')) >>>> return fn:concat ($tuple[1], ' -> ', cts:frequency ($tuple)) >>>> >>>> returns >>>> >>>> /100.xml -> 100 >>>> /99.xml -> 99 >>>> /98.xml -> 98 >>>> /97.xml -> 97 >>>> /96.xml -> 96 >>>> /95.xml -> 95 >>>> /94.xml -> 94 >>>> /93.xml -> 93 >>>> /92.xml -> 92 >>>> /91.xml -> 91 >>>> >>>> Is this close to what you want? >>>> >>>> Regards, >>>> >>>> Chris >>>> >>>> On Fri, Jun 26, 2015 at 12:41 PM, Johan Mörén <[email protected]> >>>> wrote: >>>> > Hi Christopher! >>>> > >>>> > I'm not sure where you wan't me to use these options. But i tried to >>>> add >>>> > them to the cts:value-tuples() but that did not return the expected >>>> result. >>>> > >>>> > like this >>>> > >>>> > ... >>>> > for $tuple in >>>> > cts:value-tuples( >>>> > ( >>>> > cts:uri-reference(), >>>> > $sizeRef >>>> > ), >>>> > ("frequency-order","descending","limit=10") >>>> > >>>> > ) >>>> > ... >>>> > >>>> > Regards, >>>> > Johan >>>> > >>>> > On Fri, Jun 26, 2015 at 5:58 PM Christopher Hamlin < >>>> [email protected]> >>>> > wrote: >>>> >> >>>> >> If you just want something like top ten, I think it's more direct >>>> >> possibly. >>>> >> >>>> >> Can you try returning frequency-order, descending, limit=10? Are >>>> those >>>> >> options you can use? >>>> >> >>>> >> _______________________________________________ >>>> >> General mailing list >>>> >> [email protected] >>>> >> Manage your subscription at: >>>> >> http://developer.marklogic.com/mailman/listinfo/general >>>> > >>>> > >>>> > _______________________________________________ >>>> > General mailing list >>>> > [email protected] >>>> > Manage your subscription at: >>>> > http://developer.marklogic.com/mailman/listinfo/general >>>> > >>>> _______________________________________________ >>>> General mailing list >>>> [email protected] >>>> Manage your subscription at: >>>> http://developer.marklogic.com/mailman/listinfo/general >>>> >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> Manage your subscription at: >>> http://developer.marklogic.com/mailman/listinfo/general >>> >> _______________________________________________ >> General mailing list >> [email protected] >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general >> >
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
