Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
Thanks, I’ll take a look. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of Gary Vidal Reply-To: MarkLogic Developer Discussion Date: Thursday, May 25, 2017 at 5:37 AM To: Subject: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Eliot, I will share some code I wrote using Apache Flink, which does exactly what you want to do for MarkLogic on a client machine. The problem is with such an old version of ML you are forced to pull every document out and perform analysis externally. In my previous life I wrote a version that runs on MarkLogic using spawn and parallel tasks, but not sure it would work on 4.2, but will share for sake of others. Feel free to contact me directly for any additional help https://github.com/garyvidal/ml-libraries/tree/master/task-spawner ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
Eliot, I will share some code I wrote using Apache Flink, which does exactly what you want to do for MarkLogic on a client machine. The problem is with such an old version of ML you are forced to pull every document out and perform analysis externally. In my previous life I wrote a version that runs on MarkLogic using spawn and parallel tasks, but not sure it would work on 4.2, but will share for sake of others. Feel free to contact me directly for any additional help https://github.com/garyvidal/ml-libraries/tree/master/task-spawner ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
I got what I needed by creating a simple groovy script that uses the XCC library to submit queries. Script is below. My main discovery was that I need to create a new session for every iteration to avoid connection time outs. With this I was able to process several 100 thousand docs and accumulate the results on my local machine. My command line is: groovy -cp lib/xcc.jar GetArticleMetadataDetails.groovy I chose groovy because it supports Java libraries directly and makes it easy to script things. Groovy script: #!/usr/bin/env groovy /* * Use XCC jar to run enrichment jobs and collect the results. */ import com.marklogic.xcc.*; import com.marklogic.xcc.types.*; ContentSource source = ContentSourceFactory.newContentSource("myserver", 1984, "user", "pw"); RequestOptions options = new RequestOptions(); options.setRequestTimeLimit(3600) moduleUrl = "rq-metadata-analysis.xqy" println "Running module ${moduleUrl}..." println new Date() File outfile = new File("query-result.xml") outfile.write "\n"; (36..56).each { index -> Session session = source.newSession(); ModuleInvoke request = session.newModuleInvoke(moduleUrl) println "Group number: ${index}, ${new Date()}" request.setNewIntegerVariable("", "groupNum", index); request.setNewIntegerVariable("", "length", 1); request.setOptions(options); ResultSequence rs = session.submitRequest(request); ResultItem item = rs.next(); XdmItem xdmItem = item.getItem(); InputStream is = item.asInputStream(); is.eachLine { line -> outfile.append line outfile.append "\n" } session.close(); } outfile.append ""; println "Done." // End of script. -- Eliot Kimber http://contrext.com On 5/22/17, 10:43 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I haven’t yet seen anything in the docs that directly address what I’m trying to do and suspect I’m simply missing some ML basics or just going about things the wrong way. I have a corpus of several hundred thousand docs (but could be millions, of course), where each doc is an average of 200K and several thousand elements. I want to analyze the corpus to get details about the number of specific subelements within each document, e.g.: for $article in cts:search(/Article, cts:directory-query("/Default/", "infinity"))[$start to $end] return I’m running this as a query from Oxygen (so I can capture the results locally so I can do other stuff with them). On the server I’m using I blow the expanded tree cache if I try to request more than about 20,000 docs. Is there a way to do this kind of processing over an arbitrarily large set *and* get the results back from a single query request? I think the only solution is to write the results to back to the database and then fetch that as the last thing but I was hoping there was something simpler. Have I missed an obvious solution? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
TDE is Template Driven Extraction. Short version: you define templates, matching data goes straight into the indexes without you having to modify your document structure. Tutorial: http://developer.marklogic.com/learn/template-driven-extraction -- Dave Cassel, @dmcassel <https://twitter.com/dmcassel> Technical Community Manager MarkLogic Corporation <http://www.marklogic.com/> http://developer.marklogic.com/ On 5/23/17, 7:30 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: > >What is TDE? I’m not conversant with ML 9 features yet. > >Also, I’m currently working against an ML 4.2 server (don’t ask). > >TaskBot looks like just what I need but docs say it requires ML 7+ but >could possibly be made to work with earlier releases. If someone can >point me in the right direction I can take a stab at making it work with >ML 4. > >Thanks, > >Eliot >-- >Eliot Kimber >http://contrext.com > > > > >On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf >of Erik Hennum" erik.hen...@marklogic.com> wrote: > >Hi, Eliot: > >On reflection, let me retract the range index suggestion. I wasn't >considering >the domain implied by the element names -- it would never make sense >to blow out a range index with the value of all of the paragraphs. > >The TDE suggestion for MarkLogic 9 would still work, however, because >you >could have an xs:short column with a value of 1 for every paragraph. > > >Erik Hennum > > >From: general-boun...@developer.marklogic.com >[general-boun...@developer.marklogic.com] on behalf of Erik Hennum >[erik.hen...@marklogic.com] >Sent: Tuesday, May 23, 2017 6:21 AM >To: MarkLogic Developer Discussion >Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs >to Get Statistics > >Hi, Eliot: > >One alternative to Geert's good suggestion -- if and only if the >number >of element names is small and you can create range indexes on them: > >* add an element attribute range index on Article/@id >* add an element range index on p >* execute a cts:value-tuples() call with the constraining element >query and directory query >* iterate over the tuples, incrementing the value of the id in a map >* remove the range index on p > >In MarkLogic 9, that approach gets simpler. You can just use TDE >to project rows with columns for the id and element, group on >the id column, and count the rows in the group. > >Hoping that's useful (and salutations in passing), > > >Erik Hennum > >____ > From: general-boun...@developer.marklogic.com >[general-boun...@developer.marklogic.com] on behalf of Geert Josten >[geert.jos...@marklogic.com] >Sent: Tuesday, May 23, 2017 12:53 AM >To: MarkLogic Developer Discussion >Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs >to Get Statistics > >Hi Eliot, > >I¹d consider using taskbot >(http://registry.demo.marklogic.com/package/taskbot), and using that >in >combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. >It >will make optimal use of the TaskServer of the host on which you >initiate >the call. It doesn¹t scale endlessly, but it batches up the work >automatically for you, and will get you a lot further fairly easily.. > >Cheers, >Geert > >On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on >behalf of >Eliot Kimber" ekim...@contrext.com> wrote: > >>I haven¹t yet seen anything in the docs that directly address what >I¹m >>trying to do and suspect I¹m simply missing some ML basics or just >going >>about things the wrong way. >> >>I have a corpus of several hundred thousand docs (but could be >millions, >>of course), where each doc is an average of 200K and several thousand >>elements. >> >>I want to analyze the corpus to get details about the number of >specific >>subelements within each document, e.g.: >> >> >>for $article in cts:search(/Article, cts:directory-query("/Default/", >>"infinity"))[$start to $end] >> return >paras=²{count($article//p}²/> >> >>I¹m running this as a query from Oxygen (so I can capture the results >>locally so I can do other stuff with them). >> >>On
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
What is TDE? I’m not conversant with ML 9 features yet. Also, I’m currently working against an ML 4.2 server (don’t ask). TaskBot looks like just what I need but docs say it requires ML 7+ but could possibly be made to work with earlier releases. If someone can point me in the right direction I can take a stab at making it work with ML 4. Thanks, Eliot -- Eliot Kimber http://contrext.com On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: On reflection, let me retract the range index suggestion. I wasn't considering the domain implied by the element names -- it would never make sense to blow out a range index with the value of all of the paragraphs. The TDE suggestion for MarkLogic 9 would still work, however, because you could have an xs:short column with a value of 1 for every paragraph. Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Erik Hennum [erik.hen...@marklogic.com] Sent: Tuesday, May 23, 2017 6:21 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi, Eliot: One alternative to Geert's good suggestion -- if and only if the number of element names is small and you can create range indexes on them: * add an element attribute range index on Article/@id * add an element range index on p * execute a cts:value-tuples() call with the constraining element query and directory query * iterate over the tuples, incrementing the value of the id in a map * remove the range index on p In MarkLogic 9, that approach gets simpler. You can just use TDE to project rows with columns for the id and element, group on the id column, and count the rows in the group. Hoping that's useful (and salutations in passing), Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Geert Josten [geert.jos...@marklogic.com] Sent: Tuesday, May 23, 2017 12:53 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi Eliot, I¹d consider using taskbot (http://registry.demo.marklogic.com/package/taskbot), and using that in combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will make optimal use of the TaskServer of the host on which you initiate the call. It doesn¹t scale endlessly, but it batches up the work automatically for you, and will get you a lot further fairly easily.. Cheers, Geert On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I haven¹t yet seen anything in the docs that directly address what I¹m >trying to do and suspect I¹m simply missing some ML basics or just going >about things the wrong way. > >I have a corpus of several hundred thousand docs (but could be millions, >of course), where each doc is an average of 200K and several thousand >elements. > >I want to analyze the corpus to get details about the number of specific >subelements within each document, e.g.: > > >for $article in cts:search(/Article, cts:directory-query("/Default/", >"infinity"))[$start to $end] > return paras=²{count($article//p}²/> > >I¹m running this as a query from Oxygen (so I can capture the results >locally so I can do other stuff with them). > >On the server I¹m using I blow the expanded tree cache if I try to >request more than about 20,000 docs. > >Is there a way to do this kind of processing over an arbitrarily large >set *and* get the results back from a single query request? > >I think the only solution is to write the results to back to the database >and then fetch that as the last thing but I was hoping there was >something simpler. > >Have I missed an obvious solution? > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://deve
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
Hi, Eliot: On reflection, let me retract the range index suggestion. I wasn't considering the domain implied by the element names -- it would never make sense to blow out a range index with the value of all of the paragraphs. The TDE suggestion for MarkLogic 9 would still work, however, because you could have an xs:short column with a value of 1 for every paragraph. Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Erik Hennum [erik.hen...@marklogic.com] Sent: Tuesday, May 23, 2017 6:21 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi, Eliot: One alternative to Geert's good suggestion -- if and only if the number of element names is small and you can create range indexes on them: * add an element attribute range index on Article/@id * add an element range index on p * execute a cts:value-tuples() call with the constraining element query and directory query * iterate over the tuples, incrementing the value of the id in a map * remove the range index on p In MarkLogic 9, that approach gets simpler. You can just use TDE to project rows with columns for the id and element, group on the id column, and count the rows in the group. Hoping that's useful (and salutations in passing), Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Geert Josten [geert.jos...@marklogic.com] Sent: Tuesday, May 23, 2017 12:53 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi Eliot, I¹d consider using taskbot (http://registry.demo.marklogic.com/package/taskbot), and using that in combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will make optimal use of the TaskServer of the host on which you initiate the call. It doesn¹t scale endlessly, but it batches up the work automatically for you, and will get you a lot further fairly easily.. Cheers, Geert On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I haven¹t yet seen anything in the docs that directly address what I¹m >trying to do and suspect I¹m simply missing some ML basics or just going >about things the wrong way. > >I have a corpus of several hundred thousand docs (but could be millions, >of course), where each doc is an average of 200K and several thousand >elements. > >I want to analyze the corpus to get details about the number of specific >subelements within each document, e.g.: > > >for $article in cts:search(/Article, cts:directory-query("/Default/", >"infinity"))[$start to $end] > return paras=²{count($article//p}²/> > >I¹m running this as a query from Oxygen (so I can capture the results >locally so I can do other stuff with them). > >On the server I¹m using I blow the expanded tree cache if I try to >request more than about 20,000 docs. > >Is there a way to do this kind of processing over an arbitrarily large >set *and* get the results back from a single query request? > >I think the only solution is to write the results to back to the database >and then fetch that as the last thing but I was hoping there was >something simpler. > >Have I missed an obvious solution? > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
Hi, Eliot: One alternative to Geert's good suggestion -- if and only if the number of element names is small and you can create range indexes on them: * add an element attribute range index on Article/@id * add an element range index on p * execute a cts:value-tuples() call with the constraining element query and directory query * iterate over the tuples, incrementing the value of the id in a map * remove the range index on p In MarkLogic 9, that approach gets simpler. You can just use TDE to project rows with columns for the id and element, group on the id column, and count the rows in the group. Hoping that's useful (and salutations in passing), Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Geert Josten [geert.jos...@marklogic.com] Sent: Tuesday, May 23, 2017 12:53 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi Eliot, I¹d consider using taskbot (http://registry.demo.marklogic.com/package/taskbot), and using that in combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will make optimal use of the TaskServer of the host on which you initiate the call. It doesn¹t scale endlessly, but it batches up the work automatically for you, and will get you a lot further fairly easily.. Cheers, Geert On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I haven¹t yet seen anything in the docs that directly address what I¹m >trying to do and suspect I¹m simply missing some ML basics or just going >about things the wrong way. > >I have a corpus of several hundred thousand docs (but could be millions, >of course), where each doc is an average of 200K and several thousand >elements. > >I want to analyze the corpus to get details about the number of specific >subelements within each document, e.g.: > > >for $article in cts:search(/Article, cts:directory-query("/Default/", >"infinity"))[$start to $end] > return paras=²{count($article//p}²/> > >I¹m running this as a query from Oxygen (so I can capture the results >locally so I can do other stuff with them). > >On the server I¹m using I blow the expanded tree cache if I try to >request more than about 20,000 docs. > >Is there a way to do this kind of processing over an arbitrarily large >set *and* get the results back from a single query request? > >I think the only solution is to write the results to back to the database >and then fetch that as the last thing but I was hoping there was >something simpler. > >Have I missed an obvious solution? > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
Hi Eliot, I¹d consider using taskbot (http://registry.demo.marklogic.com/package/taskbot), and using that in combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will make optimal use of the TaskServer of the host on which you initiate the call. It doesn¹t scale endlessly, but it batches up the work automatically for you, and will get you a lot further fairly easily.. Cheers, Geert On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I haven¹t yet seen anything in the docs that directly address what I¹m >trying to do and suspect I¹m simply missing some ML basics or just going >about things the wrong way. > >I have a corpus of several hundred thousand docs (but could be millions, >of course), where each doc is an average of 200K and several thousand >elements. > >I want to analyze the corpus to get details about the number of specific >subelements within each document, e.g.: > > >for $article in cts:search(/Article, cts:directory-query("/Default/", >"infinity"))[$start to $end] > return paras=²{count($article//p}²/> > >I¹m running this as a query from Oxygen (so I can capture the results >locally so I can do other stuff with them). > >On the server I¹m using I blow the expanded tree cache if I try to >request more than about 20,000 docs. > >Is there a way to do this kind of processing over an arbitrarily large >set *and* get the results back from a single query request? > >I think the only solution is to write the results to back to the database >and then fetch that as the last thing but I was hoping there was >something simpler. > >Have I missed an obvious solution? > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
I haven’t yet seen anything in the docs that directly address what I’m trying to do and suspect I’m simply missing some ML basics or just going about things the wrong way. I have a corpus of several hundred thousand docs (but could be millions, of course), where each doc is an average of 200K and several thousand elements. I want to analyze the corpus to get details about the number of specific subelements within each document, e.g.: for $article in cts:search(/Article, cts:directory-query("/Default/", "infinity"))[$start to $end] return I’m running this as a query from Oxygen (so I can capture the results locally so I can do other stuff with them). On the server I’m using I blow the expanded tree cache if I try to request more than about 20,000 docs. Is there a way to do this kind of processing over an arbitrarily large set *and* get the results back from a single query request? I think the only solution is to write the results to back to the database and then fetch that as the last thing but I was hoping there was something simpler. Have I missed an obvious solution? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general