Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Dave Cassel Tue, 23 May 2017 13:47:43 -0700

TDE is Template Driven Extraction.

Short version: you define templates, matching data goes straight into the
indexes without you having to modify your document structure.
Tutorial: http://developer.marklogic.com/learn/template-driven-extraction


-- 
Dave Cassel, @dmcassel <https://twitter.com/dmcassel>
Technical Community Manager
MarkLogic Corporation <http://www.marklogic.com/>

http://developer.marklogic.com/




On 5/23/17, 7:30 AM, "[email protected] on behalf of
Eliot Kimber" <[email protected] on behalf of
[email protected]> wrote:

>
>What is TDE? I’m not conversant with ML 9 features yet.
>
>Also, I’m currently working against an ML 4.2 server (don’t ask).
>
>TaskBot looks like just what I need but docs say it requires ML 7+ but
>could possibly be made to work with earlier releases. If someone can
>point me in the right direction I can take a stab at making it work with
>ML 4.
>
>Thanks,
>
>Eliot
>--
>Eliot Kimber
>http://contrext.com
> 
>
>
>
>On 5/23/17, 8:56 AM, "[email protected] on behalf
>of Erik Hennum" <[email protected] on behalf of
>[email protected]> wrote:
>
>    Hi, Eliot:
>    
>    On reflection, let me retract the range index suggestion.  I wasn't
>considering
>    the domain implied by the element names -- it would never make sense
>    to blow out a range index with the value of all of the paragraphs.
>    
>    The TDE suggestion for MarkLogic 9 would still work, however, because
>you
>    could have an xs:short column with a value of 1 for every paragraph.
>    
>    
>    Erik Hennum
>    
>    ________________________________________
>    From: [email protected]
>[[email protected]] on behalf of Erik Hennum
>[[email protected]]
>    Sent: Tuesday, May 23, 2017 6:21 AM
>    To: MarkLogic Developer Discussion
>    Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
>to Get Statistics
>    
>    Hi, Eliot:
>    
>    One alternative to Geert's good suggestion -- if and only if the
>number
>    of element names is small and you can create range indexes on them:
>    
>    *  add an element attribute range index on Article/@id
>    *  add an element range index on p
>    *  execute a cts:value-tuples() call with the constraining element
>query and directory query
>    *  iterate over the tuples, incrementing the value of the id in a map
>    *  remove the range index on p
>    
>    In MarkLogic 9, that approach gets simpler.  You can just use TDE
>    to project rows with columns for the id and element, group on
>    the id column, and count the rows in the group.
>    
>    Hoping that's useful (and salutations in passing),
>    
>    
>    Erik Hennum
>    
>    ________________________________________
>    From: [email protected]
>[[email protected]] on behalf of Geert Josten
>[[email protected]]
>    Sent: Tuesday, May 23, 2017 12:53 AM
>    To: MarkLogic Developer Discussion
>    Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
>to Get Statistics
>    
>    Hi Eliot,
>    
>    I¹d consider using taskbot
>    (http://registry.demo.marklogic.com/package/taskbot), and using that
>in
>    combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE.
>It
>    will make optimal use of the TaskServer of the host on which you
>initiate
>    the call. It doesn¹t scale endlessly, but it batches up the work
>    automatically for you, and will get you a lot further fairly easily..
>    
>    Cheers,
>    Geert
>    
>    On 5/23/17, 5:43 AM, "[email protected] on
>behalf of
>    Eliot Kimber" <[email protected] on behalf of
>    [email protected]> wrote:
>    
>    >I haven¹t yet seen anything in the docs that directly address what
>I¹m
>    >trying to do and suspect I¹m simply missing some ML basics or just
>going
>    >about things the wrong way.
>    >
>    >I have a corpus of several hundred thousand docs (but could be
>millions,
>    >of course), where each doc is an average of 200K and several thousand
>    >elements.
>    >
>    >I want to analyze the corpus to get details about the number of
>specific
>    >subelements within each document, e.g.:
>    >
>    >
>    >for $article in cts:search(/Article, cts:directory-query("/Default/",
>    >"infinity"))[$start to $end]
>    >     return <article-counts id=²{$article/@id}²
>    >paras=²{count($article//p}²/>
>    >
>    >I¹m running this as a query from Oxygen (so I can capture the results
>    >locally so I can do other stuff with them).
>    >
>    >On the server I¹m using I blow the expanded tree cache if I try to
>    >request more than about 20,000 docs.
>    >
>    >Is there a way to do this kind of processing over an arbitrarily
>large
>    >set *and* get the results back from a single query request?
>    >
>    >I think the only solution is to write the results to back to the
>database
>    >and then fetch that as the last thing but I was hoping there was
>    >something simpler.
>    >
>    >Have I missed an obvious solution?
>    >
>    >Thanks,
>    >
>    >Eliot
>    >
>    >--
>    >Eliot Kimber
>    >http://contrext.com
>    >
>    >
>    >
>    >
>    >_______________________________________________
>    >General mailing list
>    >[email protected]
>    >Manage your subscription at:
>    >http://developer.marklogic.com/mailman/listinfo/general
>    
>    _______________________________________________
>    General mailing list
>    [email protected]
>    Manage your subscription at:
>    http://developer.marklogic.com/mailman/listinfo/general
>    _______________________________________________
>    General mailing list
>    [email protected]
>    Manage your subscription at:
>    http://developer.marklogic.com/mailman/listinfo/general
>    _______________________________________________
>    General mailing list
>    [email protected]
>    Manage your subscription at:
>    http://developer.marklogic.com/mailman/listinfo/general
>    
>
>
>
>
>
>_______________________________________________
>General mailing list
>[email protected]
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Reply via email to