Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Eliot Kimber Tue, 23 May 2017 07:31:26 -0700

What is TDE? I’m not conversant with ML 9 features yet.

Also, I’m currently working against an ML 4.2 server (don’t ask).


TaskBot looks like just what I need but docs say it requires ML 7+ but could 
possibly be made to work with earlier releases. If someone can point me in the 
right direction I can take a stab at making it work with ML 4.

Thanks,

Eliot
--
Eliot Kimber
http://contrext.com
 



On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf of Erik 
Hennum" <general-boun...@developer.marklogic.com on behalf of 
erik.hen...@marklogic.com> wrote:

    Hi, Eliot:
    
    On reflection, let me retract the range index suggestion.  I wasn't 
considering
    the domain implied by the element names -- it would never make sense
    to blow out a range index with the value of all of the paragraphs.
    
    The TDE suggestion for MarkLogic 9 would still work, however, because you
    could have an xs:short column with a value of 1 for every paragraph.
    
    
    Erik Hennum
    
    ________________________________________
    From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Erik Hennum 
[erik.hen...@marklogic.com]
    Sent: Tuesday, May 23, 2017 6:21 AM
    To: MarkLogic Developer Discussion
    Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics
    
    Hi, Eliot:
    
    One alternative to Geert's good suggestion -- if and only if the number
    of element names is small and you can create range indexes on them:
    
    *  add an element attribute range index on Article/@id
    *  add an element range index on p
    *  execute a cts:value-tuples() call with the constraining element query 
and directory query
    *  iterate over the tuples, incrementing the value of the id in a map
    *  remove the range index on p
    
    In MarkLogic 9, that approach gets simpler.  You can just use TDE
    to project rows with columns for the id and element, group on
    the id column, and count the rows in the group.
    
    Hoping that's useful (and salutations in passing),
    
    
    Erik Hennum
    
    ________________________________________
    From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
    Sent: Tuesday, May 23, 2017 12:53 AM
    To: MarkLogic Developer Discussion
    Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics
    
    Hi Eliot,
    
    I¹d consider using taskbot
    (http://registry.demo.marklogic.com/package/taskbot), and using that in
    combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
    will make optimal use of the TaskServer of the host on which you initiate
    the call. It doesn¹t scale endlessly, but it batches up the work
    automatically for you, and will get you a lot further fairly easily..
    
    Cheers,
    Geert
    
    On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
    Eliot Kimber" <general-boun...@developer.marklogic.com on behalf of
    ekim...@contrext.com> wrote:
    
    >I haven¹t yet seen anything in the docs that directly address what I¹m
    >trying to do and suspect I¹m simply missing some ML basics or just going
    >about things the wrong way.
    >
    >I have a corpus of several hundred thousand docs (but could be millions,
    >of course), where each doc is an average of 200K and several thousand
    >elements.
    >
    >I want to analyze the corpus to get details about the number of specific
    >subelements within each document, e.g.:
    >
    >
    >for $article in cts:search(/Article, cts:directory-query("/Default/",
    >"infinity"))[$start to $end]
    >     return <article-counts id=²{$article/@id}²
    >paras=²{count($article//p}²/>
    >
    >I¹m running this as a query from Oxygen (so I can capture the results
    >locally so I can do other stuff with them).
    >
    >On the server I¹m using I blow the expanded tree cache if I try to
    >request more than about 20,000 docs.
    >
    >Is there a way to do this kind of processing over an arbitrarily large
    >set *and* get the results back from a single query request?
    >
    >I think the only solution is to write the results to back to the database
    >and then fetch that as the last thing but I was hoping there was
    >something simpler.
    >
    >Have I missed an obvious solution?
    >
    >Thanks,
    >
    >Eliot
    >
    >--
    >Eliot Kimber
    >http://contrext.com
    >
    >
    >
    >
    >_______________________________________________
    >General mailing list
    >General@developer.marklogic.com
    >Manage your subscription at:
    >http://developer.marklogic.com/mailman/listinfo/general
    
    _______________________________________________
    General mailing list
    General@developer.marklogic.com
    Manage your subscription at:
    http://developer.marklogic.com/mailman/listinfo/general
    _______________________________________________
    General mailing list
    General@developer.marklogic.com
    Manage your subscription at:
    http://developer.marklogic.com/mailman/listinfo/general
    _______________________________________________
    General mailing list
    General@developer.marklogic.com
    Manage your subscription at: 
    http://developer.marklogic.com/mailman/listinfo/general
    





_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Reply via email to