Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-25 Thread Eliot Kimber
Thanks, I’ll take a look.

 

Cheers,

 

E.

 

--

Eliot Kimber

http://contrext.com

 

 

 

From:  on behalf of Gary Vidal 

Reply-To: MarkLogic Developer Discussion 
Date: Thursday, May 25, 2017 at 5:37 AM
To: 
Subject: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

 

Eliot,

 

I will share some code I wrote using Apache Flink, which does exactly what you 
want to do for MarkLogic on a client machine.  The problem is with such an old 
version of ML you are forced to pull every document out and perform analysis 
externally.  In my previous life I wrote a version that runs on MarkLogic using 
spawn and parallel tasks, but not sure it would work on 4.2, but will share for 
sake of others.  Feel free to contact me directly for any additional help

 

https://github.com/garyvidal/ml-libraries/tree/master/task-spawner

 

 

___ General mailing list 
General@developer.marklogic.com Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general 

___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-24 Thread Eliot Kimber
I got what I needed by creating a simple groovy script that uses the XCC 
library to submit queries. Script is below. My main discovery was that I need 
to create a new session for every iteration to avoid connection time outs. With 
this I was able to process several 100 thousand docs and accumulate the results 
on my local machine. My command line is:

groovy -cp lib/xcc.jar GetArticleMetadataDetails.groovy

I chose groovy because it supports Java libraries directly and makes it easy to 
script things.

Groovy script:

#!/usr/bin/env groovy
/*
 * Use XCC jar to run enrichment jobs and collect the results.
 */
 
import com.marklogic.xcc.*;
import com.marklogic.xcc.types.*;
 
ContentSource source = ContentSourceFactory.newContentSource("myserver", 1984, 
"user", "pw");

RequestOptions options = new RequestOptions();
options.setRequestTimeLimit(3600)

moduleUrl = "rq-metadata-analysis.xqy"

println "Running module ${moduleUrl}..."
println new Date()
File outfile = new File("query-result.xml")

outfile.write "\n";

 
(36..56).each { index ->
Session session = source.newSession();
ModuleInvoke request = session.newModuleInvoke(moduleUrl)

println "Group number: ${index}, ${new Date()}"
request.setNewIntegerVariable("", "groupNum", index);
request.setNewIntegerVariable("", "length", 1);

request.setOptions(options);

ResultSequence rs = session.submitRequest(request);

ResultItem item = rs.next();
XdmItem xdmItem = item.getItem();
InputStream is = item.asInputStream();

is.eachLine { line ->
  outfile.append line
  outfile.append "\n"
}
session.close();
}

outfile.append "";

println "Done."
//  End of script.

--
Eliot Kimber
http://contrext.com
 



On 5/22/17, 10:43 PM, "general-boun...@developer.marklogic.com on behalf of 
Eliot Kimber"  wrote:

I haven’t yet seen anything in the docs that directly address what I’m 
trying to do and suspect I’m simply missing some ML basics or just going about 
things the wrong way.

I have a corpus of several hundred thousand docs (but could be millions, of 
course), where each doc is an average of 200K and several thousand elements.

I want to analyze the corpus to get details about the number of specific 
subelements within each document, e.g.:


for $article in cts:search(/Article, cts:directory-query("/Default/", 
"infinity"))[$start to $end]
 return 

I’m running this as a query from Oxygen (so I can capture the results 
locally so I can do other stuff with them).

On the server I’m using I blow the expanded tree cache if I try to request 
more than about 20,000 docs.

Is there a way to do this kind of processing over an arbitrarily large set 
*and* get the results back from a single query request?

I think the only solution is to write the results to back to the database 
and then fetch that as the last thing but I was hoping there was something 
simpler.

Have I missed an obvious solution?

Thanks,

Eliot

--
Eliot Kimber
http://contrext.com
 



___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general



___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Dave Cassel
TDE is Template Driven Extraction.

Short version: you define templates, matching data goes straight into the
indexes without you having to modify your document structure.
Tutorial: http://developer.marklogic.com/learn/template-driven-extraction

-- 
Dave Cassel, @dmcassel <https://twitter.com/dmcassel>
Technical Community Manager
MarkLogic Corporation <http://www.marklogic.com/>

http://developer.marklogic.com/




On 5/23/17, 7:30 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>
>What is TDE? I’m not conversant with ML 9 features yet.
>
>Also, I’m currently working against an ML 4.2 server (don’t ask).
>
>TaskBot looks like just what I need but docs say it requires ML 7+ but
>could possibly be made to work with earlier releases. If someone can
>point me in the right direction I can take a stab at making it work with
>ML 4.
>
>Thanks,
>
>Eliot
>--
>Eliot Kimber
>http://contrext.com
> 
>
>
>
>On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf
>of Erik Hennum" erik.hen...@marklogic.com> wrote:
>
>Hi, Eliot:
>
>On reflection, let me retract the range index suggestion.  I wasn't
>considering
>the domain implied by the element names -- it would never make sense
>to blow out a range index with the value of all of the paragraphs.
>
>The TDE suggestion for MarkLogic 9 would still work, however, because
>you
>could have an xs:short column with a value of 1 for every paragraph.
>
>
>Erik Hennum
>
>
>From: general-boun...@developer.marklogic.com
>[general-boun...@developer.marklogic.com] on behalf of Erik Hennum
>[erik.hen...@marklogic.com]
>Sent: Tuesday, May 23, 2017 6:21 AM
>To: MarkLogic Developer Discussion
>Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
>to Get Statistics
>
>Hi, Eliot:
>
>One alternative to Geert's good suggestion -- if and only if the
>number
>of element names is small and you can create range indexes on them:
>
>*  add an element attribute range index on Article/@id
>*  add an element range index on p
>*  execute a cts:value-tuples() call with the constraining element
>query and directory query
>*  iterate over the tuples, incrementing the value of the id in a map
>*  remove the range index on p
>
>In MarkLogic 9, that approach gets simpler.  You can just use TDE
>to project rows with columns for the id and element, group on
>the id column, and count the rows in the group.
>
>Hoping that's useful (and salutations in passing),
>
>
>Erik Hennum
>
>________
>    From: general-boun...@developer.marklogic.com
>[general-boun...@developer.marklogic.com] on behalf of Geert Josten
>[geert.jos...@marklogic.com]
>Sent: Tuesday, May 23, 2017 12:53 AM
>To: MarkLogic Developer Discussion
>Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
>to Get Statistics
>
>Hi Eliot,
>
>I¹d consider using taskbot
>(http://registry.demo.marklogic.com/package/taskbot), and using that
>in
>combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE.
>It
>will make optimal use of the TaskServer of the host on which you
>initiate
>the call. It doesn¹t scale endlessly, but it batches up the work
>automatically for you, and will get you a lot further fairly easily..
>
>Cheers,
>Geert
>
>On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on
>behalf of
>Eliot Kimber" ekim...@contrext.com> wrote:
>
>>I haven¹t yet seen anything in the docs that directly address what
>I¹m
>>trying to do and suspect I¹m simply missing some ML basics or just
>going
>>about things the wrong way.
>>
>>I have a corpus of several hundred thousand docs (but could be
>millions,
>>of course), where each doc is an average of 200K and several thousand
>>elements.
>>
>>I want to analyze the corpus to get details about the number of
>specific
>>subelements within each document, e.g.:
>>
>>
>>for $article in cts:search(/Article, cts:directory-query("/Default/",
>>"infinity"))[$start to $end]
>> return >paras=²{count($article//p}²/>
>>
>>I¹m running this as a query from Oxygen (so I can capture the results
>>locally so I can do other stuff with them).
>>
>>On 

Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Eliot Kimber

What is TDE? I’m not conversant with ML 9 features yet.

Also, I’m currently working against an ML 4.2 server (don’t ask).

TaskBot looks like just what I need but docs say it requires ML 7+ but could 
possibly be made to work with earlier releases. If someone can point me in the 
right direction I can take a stab at making it work with ML 4.

Thanks,

Eliot
--
Eliot Kimber
http://contrext.com
 



On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf of Erik 
Hennum"  wrote:

Hi, Eliot:

On reflection, let me retract the range index suggestion.  I wasn't 
considering
the domain implied by the element names -- it would never make sense
to blow out a range index with the value of all of the paragraphs.

The TDE suggestion for MarkLogic 9 would still work, however, because you
could have an xs:short column with a value of 1 for every paragraph.


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Erik Hennum 
[erik.hen...@marklogic.com]
Sent: Tuesday, May 23, 2017 6:21 AM
To: MarkLogic Developer Discussion
    Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number
of element names is small and you can create range indexes on them:

*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query 
and directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE
to project rows with columns for the id and element, group on
the id column, and count the rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
    Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
>
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://deve

Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Erik Hennum
Hi, Eliot:

On reflection, let me retract the range index suggestion.  I wasn't considering
the domain implied by the element names -- it would never make sense
to blow out a range index with the value of all of the paragraphs.

The TDE suggestion for MarkLogic 9 would still work, however, because you
could have an xs:short column with a value of 1 for every paragraph.


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Erik Hennum 
[erik.hen...@marklogic.com]
Sent: Tuesday, May 23, 2017 6:21 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number
of element names is small and you can create range indexes on them:

*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query and 
directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE
to project rows with columns for the id and element, group on
the id column, and count the rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
>
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Erik Hennum
Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number 
of element names is small and you can create range indexes on them:

*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query and 
directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE
to project rows with columns for the id and element, group on 
the id column, and count the rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
>
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Geert Josten
Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
> 
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general