Hi, Eliot: On reflection, let me retract the range index suggestion. I wasn't considering the domain implied by the element names -- it would never make sense to blow out a range index with the value of all of the paragraphs.
The TDE suggestion for MarkLogic 9 would still work, however, because you could have an xs:short column with a value of 1 for every paragraph. Erik Hennum ________________________________________ From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Erik Hennum [erik.hen...@marklogic.com] Sent: Tuesday, May 23, 2017 6:21 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi, Eliot: One alternative to Geert's good suggestion -- if and only if the number of element names is small and you can create range indexes on them: * add an element attribute range index on Article/@id * add an element range index on p * execute a cts:value-tuples() call with the constraining element query and directory query * iterate over the tuples, incrementing the value of the id in a map * remove the range index on p In MarkLogic 9, that approach gets simpler. You can just use TDE to project rows with columns for the id and element, group on the id column, and count the rows in the group. Hoping that's useful (and salutations in passing), Erik Hennum ________________________________________ From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Geert Josten [geert.jos...@marklogic.com] Sent: Tuesday, May 23, 2017 12:53 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi Eliot, I¹d consider using taskbot (http://registry.demo.marklogic.com/package/taskbot), and using that in combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will make optimal use of the TaskServer of the host on which you initiate the call. It doesn¹t scale endlessly, but it batches up the work automatically for you, and will get you a lot further fairly easily.. Cheers, Geert On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" <general-boun...@developer.marklogic.com on behalf of ekim...@contrext.com> wrote: >I haven¹t yet seen anything in the docs that directly address what I¹m >trying to do and suspect I¹m simply missing some ML basics or just going >about things the wrong way. > >I have a corpus of several hundred thousand docs (but could be millions, >of course), where each doc is an average of 200K and several thousand >elements. > >I want to analyze the corpus to get details about the number of specific >subelements within each document, e.g.: > > >for $article in cts:search(/Article, cts:directory-query("/Default/", >"infinity"))[$start to $end] > return <article-counts id=²{$article/@id}² >paras=²{count($article//p}²/> > >I¹m running this as a query from Oxygen (so I can capture the results >locally so I can do other stuff with them). > >On the server I¹m using I blow the expanded tree cache if I try to >request more than about 20,000 docs. > >Is there a way to do this kind of processing over an arbitrarily large >set *and* get the results back from a single query request? > >I think the only solution is to write the results to back to the database >and then fetch that as the last thing but I was hoping there was >something simpler. > >Have I missed an obvious solution? > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > > >_______________________________________________ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general