Hi Geert!

Compiled it and tested it locally ( on a mac )  and i runs fine. Don't have
that many documents locally, roughly 700 documents. It returns in 0.08
seconds if i flip the returned map. If i don't flip it the operation takes
0.004 seconds. So i think the (-) operator is relatively expensive.

Right now i don't have the possibility to compile and install this in our
test-environment (redhat). Our devops-guy is on holiday.

Thanks for the help! I learned a lot on the way :)

These are the queries i ran

(: with flip  *PT0.080913S *:)
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";;
declare namespace mets = "http://www.loc.gov/METS/";;

let $map := cts:aggregate(
  "grtjn/doc-count",
  "doc-count",
  (
    cts:uri-reference(),
    cts:element-attribute-reference(xs:QName("mets:file"), xs:QName("SIZE"))
  )
)

return -$map;

(: without filp *PT0.004229S *:)

xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";;
declare namespace mets = "http://www.loc.gov/METS/";;

let $map := cts:aggregate(
  "grtjn/doc-count",
  "doc-count",
  (
    cts:uri-reference(),
    cts:element-attribute-reference(xs:QName("mets:file"), xs:QName("SIZE"))
  )
)

return $map;


Regards!
Johan



On Mon, Jun 29, 2015 at 3:38 PM Johan Mörén <[email protected]> wrote:

> Thanks Geert!
>
> I will try this tomorrow!
>
> Regards,
> Johan
>
> On Mon, Jun 29, 2015 at 3:25 PM Geert Josten <[email protected]>
> wrote:
>
>>  I ran some profiles as well, I think your profile results could have
>> been a bit misleading. I suspect walking over uris, and summing frequencies
>> is one of the major slow parts. I managed to create a UDF though (my first
>> one). It is relatively generic, and should allow summing frequencies of any
>> elem/attrib. You can download it from here:
>>
>>  https://github.com/grtjn/doc-count-udf
>>
>>  Get/clone it, run `make` (pref on the target env), and follow
>> instructions to install it. After that you can run:
>>
>>  let $uris := cts:aggregate(
>>   "gjosten/doc-count",
>>   "doc-count",
>>   (
>>     cts:uri-reference(),
>>     cts:element-attribute-reference(xs:QName("file"), xs:QName("size"))
>>   )
>> )
>> let $counts := -$uris
>> let $top-keys :=
>>   for $key in map:keys($counts)
>>   order by xs:int($key) descending
>>   return $key
>> return (
>>   for $key in $top-keys
>>   for $value in map:get($counts, $key)
>>   return $value || " - " || $key
>> )[1 to 10]
>>
>>  I tested with 1k docs, and my earlier tuples approach took 14 sec with
>> that, less than 1 sec with this..
>>
>>  Cheers,
>> Geert
>>
>>   From: Johan Mörén <[email protected]>
>>
>> Reply-To: MarkLogic Developer Discussion <[email protected]
>> >
>> Date: Sunday, June 28, 2015 at 12:07 AM
>>
>> To: MarkLogic Developer Discussion <[email protected]>
>> Subject: Re: [MarkLogic Dev General] Find the document(s) with max
>> occurrences of an element-attribute reference
>>
>>   Thanks again for looking into this Geert!
>>
>>  I tried a mix of your approach (minus the -$uris part)  and mine and
>> got better results. But that will not give me the ability to sort the whole
>> database based on occurrence. Just got me the document(s) with the maximum
>> number of occurrences. I tried this query in production where we have 1.4
>> million documents and the total number of file-elements is roughly 25
>> million. Got the result back in about 3 minutes. So it was definitely an
>> improvement.  But it will not scale over time. Thanks for looking down the
>> UDF path. Hopefully this could lead to a more general an useful approach.
>>
>>  Cheers,
>>  Johan
>>
>>  On Sat, Jun 27, 2015 at 8:06 PM Geert Josten <[email protected]>
>> wrote:
>>
>>>  My approach was similar, but tried to sum all frequencies per uri.
>>> Unfortunately, that approach gets slower with more documents, and more
>>> distinct file sizes. Adding a simple count attribute or element in the file
>>> somewhere would greatly simplify the run-time calculation, and that is what
>>> I would normally recommend. For the sake of completeness I’ll give it some
>>> more thought to see if there are ways to improve on the 3 minutes. A UDF
>>> might be useful, would have to try that..
>>>
>>>  Cheers,
>>> Geert
>>>
>>>     From: Johan Mörén <[email protected]>
>>> Reply-To: MarkLogic Developer Discussion <
>>> [email protected]>
>>>    Date: Saturday, June 27, 2015 at 1:23 AM
>>> To: MarkLogic Developer Discussion <[email protected]>
>>> Subject: Re: [MarkLogic Dev General] Find the document(s) with max
>>> occurrences of an element-attribute reference
>>>
>>>   Hi Christopher
>>>
>>>  I tried your approach but still without success. I think the case
>>> might be that your example is using a fixed vale for size ("yes"). And
>>> since frequency is based on the the value you get the right results.
>>>
>>>  Regards,
>>> Johan
>>>
>>>
>>>
>>>  On Sat, Jun 27, 2015 at 12:34 AM Christopher Hamlin <[email protected]>
>>> wrote:
>>>
>>>> Hi Johan,
>>>>
>>>> Maybe I'm not clear on what you want.
>>>>
>>>> I just tried something.  I created documents in a database using
>>>>
>>>> xquery version "1.0-ml";
>>>> for $i in 1 to 100
>>>> let $doc := <doc>{(1 to $i)!<file size='yes'/>}</doc>
>>>> let $uri := '/'||$i||'.xml'
>>>> return xdmp:document-insert ($uri, $doc)
>>>>
>>>> so for example
>>>>
>>>> /1.xml =>
>>>>
>>>> <doc>
>>>> <file size="yes"/>
>>>> </doc>
>>>>
>>>> and
>>>>
>>>> /2.xml =>
>>>>
>>>> <doc>
>>>> <file size="yes"/>
>>>> <file size="yes"/>
>>>> </doc>
>>>>
>>>> and so on.
>>>>
>>>> With a file/@size element-attribute range index, the query
>>>>
>>>> xquery version '1.0-ml';
>>>> let $uris := cts:uri-reference()
>>>> let $ea := cts:element-attribute-reference (xs:QName ('file'),
>>>> xs:QName ('size'),
>>>> 'collation=http://marklogic.com/collation/codepoint')
>>>> return
>>>>     for $tuple in cts:value-tuples(($uris, $ea),
>>>> ('item-frequency','frequency-order','descending','limit=3'))
>>>>     return fn:concat ($tuple[1], ' -> ', cts:frequency ($tuple))
>>>>
>>>> returns
>>>>
>>>> /100.xml -> 100
>>>> /99.xml -> 99
>>>> /98.xml -> 98
>>>> /97.xml -> 97
>>>> /96.xml -> 96
>>>> /95.xml -> 95
>>>> /94.xml -> 94
>>>> /93.xml -> 93
>>>> /92.xml -> 92
>>>> /91.xml -> 91
>>>>
>>>> Is this close to what you want?
>>>>
>>>> Regards,
>>>>
>>>> Chris
>>>>
>>>> On Fri, Jun 26, 2015 at 12:41 PM, Johan Mörén <[email protected]>
>>>> wrote:
>>>> > Hi Christopher!
>>>> >
>>>> > I'm not sure where you wan't me to use these options. But i tried to
>>>> add
>>>> > them to the cts:value-tuples()  but that did not return the expected
>>>> result.
>>>> >
>>>> > like this
>>>> >
>>>> > ...
>>>> > for $tuple in
>>>> >     cts:value-tuples(
>>>> >       (
>>>> >         cts:uri-reference(),
>>>> >         $sizeRef
>>>> >       ),
>>>> >       ("frequency-order","descending","limit=10")
>>>> >
>>>> >     )
>>>> > ...
>>>> >
>>>> > Regards,
>>>> > Johan
>>>> >
>>>> > On Fri, Jun 26, 2015 at 5:58 PM Christopher Hamlin <
>>>> [email protected]>
>>>> > wrote:
>>>> >>
>>>> >> If you just want something like top ten, I think it's more direct
>>>> >> possibly.
>>>> >>
>>>> >> Can you try returning frequency-order, descending, limit=10? Are
>>>> those
>>>> >> options you can use?
>>>> >>
>>>> >> _______________________________________________
>>>> >> General mailing list
>>>> >> [email protected]
>>>> >> Manage your subscription at:
>>>> >> http://developer.marklogic.com/mailman/listinfo/general
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > General mailing list
>>>> > [email protected]
>>>> > Manage your subscription at:
>>>> > http://developer.marklogic.com/mailman/listinfo/general
>>>> >
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> Manage your subscription at:
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>
>>>   _______________________________________________
>>> General mailing list
>>> [email protected]
>>> Manage your subscription at:
>>> http://developer.marklogic.com/mailman/listinfo/general
>>>
>>   _______________________________________________
>> General mailing list
>> [email protected]
>> Manage your subscription at:
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to