Thanks Geert!

I will try this tomorrow!

Regards,
Johan

On Mon, Jun 29, 2015 at 3:25 PM Geert Josten <[email protected]>
wrote:

>  I ran some profiles as well, I think your profile results could have
> been a bit misleading. I suspect walking over uris, and summing frequencies
> is one of the major slow parts. I managed to create a UDF though (my first
> one). It is relatively generic, and should allow summing frequencies of any
> elem/attrib. You can download it from here:
>
>  https://github.com/grtjn/doc-count-udf
>
>  Get/clone it, run `make` (pref on the target env), and follow
> instructions to install it. After that you can run:
>
>  let $uris := cts:aggregate(
>   "gjosten/doc-count",
>   "doc-count",
>   (
>     cts:uri-reference(),
>     cts:element-attribute-reference(xs:QName("file"), xs:QName("size"))
>   )
> )
> let $counts := -$uris
> let $top-keys :=
>   for $key in map:keys($counts)
>   order by xs:int($key) descending
>   return $key
> return (
>   for $key in $top-keys
>   for $value in map:get($counts, $key)
>   return $value || " - " || $key
> )[1 to 10]
>
>  I tested with 1k docs, and my earlier tuples approach took 14 sec with
> that, less than 1 sec with this..
>
>  Cheers,
> Geert
>
>   From: Johan Mörén <[email protected]>
>
> Reply-To: MarkLogic Developer Discussion <[email protected]>
> Date: Sunday, June 28, 2015 at 12:07 AM
>
> To: MarkLogic Developer Discussion <[email protected]>
> Subject: Re: [MarkLogic Dev General] Find the document(s) with max
> occurrences of an element-attribute reference
>
>   Thanks again for looking into this Geert!
>
>  I tried a mix of your approach (minus the -$uris part)  and mine and got
> better results. But that will not give me the ability to sort the whole
> database based on occurrence. Just got me the document(s) with the maximum
> number of occurrences. I tried this query in production where we have 1.4
> million documents and the total number of file-elements is roughly 25
> million. Got the result back in about 3 minutes. So it was definitely an
> improvement.  But it will not scale over time. Thanks for looking down the
> UDF path. Hopefully this could lead to a more general an useful approach.
>
>  Cheers,
>  Johan
>
>  On Sat, Jun 27, 2015 at 8:06 PM Geert Josten <[email protected]>
> wrote:
>
>>  My approach was similar, but tried to sum all frequencies per uri.
>> Unfortunately, that approach gets slower with more documents, and more
>> distinct file sizes. Adding a simple count attribute or element in the file
>> somewhere would greatly simplify the run-time calculation, and that is what
>> I would normally recommend. For the sake of completeness I’ll give it some
>> more thought to see if there are ways to improve on the 3 minutes. A UDF
>> might be useful, would have to try that..
>>
>>  Cheers,
>> Geert
>>
>>     From: Johan Mörén <[email protected]>
>> Reply-To: MarkLogic Developer Discussion <[email protected]
>> >
>>    Date: Saturday, June 27, 2015 at 1:23 AM
>> To: MarkLogic Developer Discussion <[email protected]>
>> Subject: Re: [MarkLogic Dev General] Find the document(s) with max
>> occurrences of an element-attribute reference
>>
>>   Hi Christopher
>>
>>  I tried your approach but still without success. I think the case might
>> be that your example is using a fixed vale for size ("yes"). And since
>> frequency is based on the the value you get the right results.
>>
>>  Regards,
>> Johan
>>
>>
>>
>>  On Sat, Jun 27, 2015 at 12:34 AM Christopher Hamlin <[email protected]>
>> wrote:
>>
>>> Hi Johan,
>>>
>>> Maybe I'm not clear on what you want.
>>>
>>> I just tried something.  I created documents in a database using
>>>
>>> xquery version "1.0-ml";
>>> for $i in 1 to 100
>>> let $doc := <doc>{(1 to $i)!<file size='yes'/>}</doc>
>>> let $uri := '/'||$i||'.xml'
>>> return xdmp:document-insert ($uri, $doc)
>>>
>>> so for example
>>>
>>> /1.xml =>
>>>
>>> <doc>
>>> <file size="yes"/>
>>> </doc>
>>>
>>> and
>>>
>>> /2.xml =>
>>>
>>> <doc>
>>> <file size="yes"/>
>>> <file size="yes"/>
>>> </doc>
>>>
>>> and so on.
>>>
>>> With a file/@size element-attribute range index, the query
>>>
>>> xquery version '1.0-ml';
>>> let $uris := cts:uri-reference()
>>> let $ea := cts:element-attribute-reference (xs:QName ('file'),
>>> xs:QName ('size'),
>>> 'collation=http://marklogic.com/collation/codepoint')
>>> return
>>>     for $tuple in cts:value-tuples(($uris, $ea),
>>> ('item-frequency','frequency-order','descending','limit=3'))
>>>     return fn:concat ($tuple[1], ' -> ', cts:frequency ($tuple))
>>>
>>> returns
>>>
>>> /100.xml -> 100
>>> /99.xml -> 99
>>> /98.xml -> 98
>>> /97.xml -> 97
>>> /96.xml -> 96
>>> /95.xml -> 95
>>> /94.xml -> 94
>>> /93.xml -> 93
>>> /92.xml -> 92
>>> /91.xml -> 91
>>>
>>> Is this close to what you want?
>>>
>>> Regards,
>>>
>>> Chris
>>>
>>> On Fri, Jun 26, 2015 at 12:41 PM, Johan Mörén <[email protected]>
>>> wrote:
>>> > Hi Christopher!
>>> >
>>> > I'm not sure where you wan't me to use these options. But i tried to
>>> add
>>> > them to the cts:value-tuples()  but that did not return the expected
>>> result.
>>> >
>>> > like this
>>> >
>>> > ...
>>> > for $tuple in
>>> >     cts:value-tuples(
>>> >       (
>>> >         cts:uri-reference(),
>>> >         $sizeRef
>>> >       ),
>>> >       ("frequency-order","descending","limit=10")
>>> >
>>> >     )
>>> > ...
>>> >
>>> > Regards,
>>> > Johan
>>> >
>>> > On Fri, Jun 26, 2015 at 5:58 PM Christopher Hamlin <[email protected]
>>> >
>>> > wrote:
>>> >>
>>> >> If you just want something like top ten, I think it's more direct
>>> >> possibly.
>>> >>
>>> >> Can you try returning frequency-order, descending, limit=10? Are those
>>> >> options you can use?
>>> >>
>>> >> _______________________________________________
>>> >> General mailing list
>>> >> [email protected]
>>> >> Manage your subscription at:
>>> >> http://developer.marklogic.com/mailman/listinfo/general
>>> >
>>> >
>>> > _______________________________________________
>>> > General mailing list
>>> > [email protected]
>>> > Manage your subscription at:
>>> > http://developer.marklogic.com/mailman/listinfo/general
>>> >
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> Manage your subscription at:
>>> http://developer.marklogic.com/mailman/listinfo/general
>>>
>>   _______________________________________________
>> General mailing list
>> [email protected]
>> Manage your subscription at:
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>   _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to