Re: [MarkLogic Dev General] concurrent invocation of xquery ending up with duplicate writes

2017-05-23 Thread Geert Josten
Hi Raghu,

The best way to ensure concurrent threads not creating a file at the same uri, 
*is* by using locks. Here is code and some explanation on how to best do that: 
http://registry.demo.marklogic.com/package/ml-unique

Cheers,
Geert

From: 
>
 on behalf of Raghu 
>
Reply-To: MarkLogic Developer Discussion 
>
Date: Tuesday, May 23, 2017 at 8:54 PM
To: General MarkLogic Developer Discussion 
>
Subject: [MarkLogic Dev General] concurrent invocation of xquery ending up with 
duplicate writes

All,

I have a reader.xqy, which does only read operation and does not write to the 
forest, except for one doc insert. I don't want that reader query to obtain 
lock on all referenced documents, so I move that document insert logic to a 
seperate writer.xqy and invoke it from reader.xqy.

My current logic is

if random-xml already exists

DO NOTHING

else INSERT RANDOM-XML

The problem I am facing is,

when I invoke the reader xqy using multiple threads concurrently, I am ending 
up with duplicate writer xmls even though I have validations in place. How do I 
make sure that even if the reader xml is invoked concurrently by several 
threads, only one of the invocation has to insert an xml?


Note: I need that random-xml inserted, before the reader.xqy completes 
execution and the URI of the random-xml involves dynamically generated ID and 
NOT a constant URI.

Thanks in advance
Raghu

___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Dave Cassel
TDE is Template Driven Extraction.

Short version: you define templates, matching data goes straight into the
indexes without you having to modify your document structure.
Tutorial: http://developer.marklogic.com/learn/template-driven-extraction

-- 
Dave Cassel, @dmcassel 
Technical Community Manager
MarkLogic Corporation 

http://developer.marklogic.com/




On 5/23/17, 7:30 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>
>What is TDE? I’m not conversant with ML 9 features yet.
>
>Also, I’m currently working against an ML 4.2 server (don’t ask).
>
>TaskBot looks like just what I need but docs say it requires ML 7+ but
>could possibly be made to work with earlier releases. If someone can
>point me in the right direction I can take a stab at making it work with
>ML 4.
>
>Thanks,
>
>Eliot
>--
>Eliot Kimber
>http://contrext.com
> 
>
>
>
>On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf
>of Erik Hennum" erik.hen...@marklogic.com> wrote:
>
>Hi, Eliot:
>
>On reflection, let me retract the range index suggestion.  I wasn't
>considering
>the domain implied by the element names -- it would never make sense
>to blow out a range index with the value of all of the paragraphs.
>
>The TDE suggestion for MarkLogic 9 would still work, however, because
>you
>could have an xs:short column with a value of 1 for every paragraph.
>
>
>Erik Hennum
>
>
>From: general-boun...@developer.marklogic.com
>[general-boun...@developer.marklogic.com] on behalf of Erik Hennum
>[erik.hen...@marklogic.com]
>Sent: Tuesday, May 23, 2017 6:21 AM
>To: MarkLogic Developer Discussion
>Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
>to Get Statistics
>
>Hi, Eliot:
>
>One alternative to Geert's good suggestion -- if and only if the
>number
>of element names is small and you can create range indexes on them:
>
>*  add an element attribute range index on Article/@id
>*  add an element range index on p
>*  execute a cts:value-tuples() call with the constraining element
>query and directory query
>*  iterate over the tuples, incrementing the value of the id in a map
>*  remove the range index on p
>
>In MarkLogic 9, that approach gets simpler.  You can just use TDE
>to project rows with columns for the id and element, group on
>the id column, and count the rows in the group.
>
>Hoping that's useful (and salutations in passing),
>
>
>Erik Hennum
>
>
>From: general-boun...@developer.marklogic.com
>[general-boun...@developer.marklogic.com] on behalf of Geert Josten
>[geert.jos...@marklogic.com]
>Sent: Tuesday, May 23, 2017 12:53 AM
>To: MarkLogic Developer Discussion
>Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
>to Get Statistics
>
>Hi Eliot,
>
>I¹d consider using taskbot
>(http://registry.demo.marklogic.com/package/taskbot), and using that
>in
>combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE.
>It
>will make optimal use of the TaskServer of the host on which you
>initiate
>the call. It doesn¹t scale endlessly, but it batches up the work
>automatically for you, and will get you a lot further fairly easily..
>
>Cheers,
>Geert
>
>On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on
>behalf of
>Eliot Kimber" ekim...@contrext.com> wrote:
>
>>I haven¹t yet seen anything in the docs that directly address what
>I¹m
>>trying to do and suspect I¹m simply missing some ML basics or just
>going
>>about things the wrong way.
>>
>>I have a corpus of several hundred thousand docs (but could be
>millions,
>>of course), where each doc is an average of 200K and several thousand
>>elements.
>>
>>I want to analyze the corpus to get details about the number of
>specific
>>subelements within each document, e.g.:
>>
>>
>>for $article in cts:search(/Article, cts:directory-query("/Default/",
>>"infinity"))[$start to $end]
>> return >paras=²{count($article//p}²/>
>>
>>I¹m running this as a query from Oxygen (so I can capture the results
>>locally so I can do other stuff with them).
>>
>>On the server I¹m using I blow the expanded tree cache if I try to
>>request more than about 20,000 docs.
>>
>>Is there a way to do this kind of processing over an arbitrarily
>large
>>set *and* get the results back from a single query request?
>>
>>I think the only solution is to write the results to 

Re: [MarkLogic Dev General] Unsubscribe

2017-05-23 Thread Raymond Lawrence
m>
> > Subject: Re: [MarkLogic Dev General] Priorities for queries
> > To: MarkLogic Developer Discussion <general@developer.marklogic.com>
> > Message-ID: <d549af3b.117ddb%geert.jos...@marklogic.com>
> > Content-Type: text/plain; charset="windows-1252"
> >
> > Hi Oleksii,
> >
> > If you use xdmp:spawn or xdmp:spawn-function, you would be able to use
> > the  option. It takes ?normal? and ?higher? as values. These
> > priorities have separate queues and worker threads, so they should
> > interfere less with each other.
> >
> > It might also be worth looking into a way to push out low priority
> > work to a dedicated host for longer running tasks. You could do that
> > by writing such queries to the database, have a schedule running on
> > that particular host monitor for such tasks, which picks them up 1 by
> > 1, and writes back results once done. It might be easiest to switch
> > around script queries to an asynchronous process that polls regularly
> > to see if results have been written. Makes sense?
> >
> > Cheers,
> > Geert
> >
> > From: <general-boun...@developer.marklogic.com<mailto:general-
> > boun...@developer.marklogic.com>> on behalf of Oleksii Segeda <
> > oseg...@worldbankgroup.org<mailto:oseg...@worldbankgroup.org>>
> > Reply-To: MarkLogic Developer Discussion
> > <general@developer.marklogic.com
> > <mailto:general@developer.marklogic.com>>
> > Date: Monday, May 22, 2017 at 8:59 PM
> > To: "general@developer.marklogic.com<mailto:general@developer.
> > marklogic.com>" <general@developer.marklogic.com general@developer.
> > marklogic.com>>
> > Subject: [MarkLogic Dev General] Priorities for queries
> >
> > Hi,
> >
> > Is there a way to give a lower priority to certain queries? We have
> > two different types of API consumers ? real users and various scripts.
> > No matter how often scripts are hitting endpoints or how ?heavy? are
> > their queries, they should not affect API performance for real users.
> > In other words, scripts are tolerant of high latency, but users are not.
> >
> > Regards,
> >
> > Oleksii Segeda
> >
> > IT Analyst
> >
> > Information and Technology Solutions
> >
> > W
> >
> > www.worldbank.org<http://www.worldbank.org/>
> >
> > [http://siteresources.worldbank.org/NEWS/Images/spacer.png]
> >
> > [http://siteresources.worldbank.org/NEWS/Images/WBG_
> > Information_and_Technology_Solutions.png]
> >
> >
> >
> > -- next part -- An HTML attachment was
> > scrubbed...
> > URL: http://developer.marklogic.com/pipermail/general/
> > attachments/20170523/c01547ba/attachment.html
> > -- next part -- A non-text attachment was
> > scrubbed...
> > Name: image003.png
> > Type: image/png
> > Size: 6577 bytes
> > Desc: image003.png
> > Url : http://developer.marklogic.com/pipermail/general/
> > attachments/20170523/c01547ba/attachment.png
> > -- next part -- A non-text attachment was
> > scrubbed...
> > Name: image002.png
> > Type: image/png
> > Size: 170 bytes
> > Desc: image002.png
> > Url : http://developer.marklogic.com/pipermail/general/
> > attachments/20170523/c01547ba/attachment-0001.png
> >
> > --
> >
> > ___
> > General mailing list
> > General@developer.marklogic.com
> > Manage your subscription at:
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> >
> > End of General Digest, Vol 155, Issue 24
> > 
> >
> -- next part --
> An HTML attachment was scrubbed...
> URL: http://developer.marklogic.com/pipermail/general/
> attachments/20170523/bf09ba37/attachment-0001.html
>
> --
>
> Message: 2
> Date: Tue, 23 May 2017 13:56:00 +
> From: Erik Hennum <erik.hen...@marklogic.com>
> Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
> to Get Statistics
> To: MarkLogic Developer Discussion <general@developer.marklogic.com>
> Message-ID:
> <dfdf2fd50bf5aa42adaf93ff2e3ca1850c7f4...@exchg10-be02.marklogic.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi, Eliot:
>
> On reflection, let me retract the range index suggestion.  I wasn't
> considering the domain implied by the el

[MarkLogic Dev General] concurrent invocation of xquery ending up with duplicate writes

2017-05-23 Thread Raghu
All,

I have a reader.xqy, which does only read operation and does not write to
the forest, except for one doc insert. I don't want that reader query to
obtain lock on all referenced documents, so I move that document insert
logic to a seperate writer.xqy and invoke it from reader.xqy.

My current logic is

if random-xml already exists

DO NOTHING

else INSERT RANDOM-XML

The problem I am facing is,

when I invoke the reader xqy using multiple threads concurrently, I am
ending up with duplicate writer xmls even though I have validations in
place. How do I make sure that even if the reader xml is invoked
concurrently by several threads, only one of the invocation has to insert
an xml?


Note: I need that random-xml inserted, before the reader.xqy completes
execution and the URI of the random-xml involves dynamically generated ID
and NOT a constant URI.

Thanks in advance
Raghu
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


[MarkLogic Dev General] Ignoring empty/null values while search

2017-05-23 Thread Shiv Shankar
Hi,
How can I skip empty/null values as my dob property has empty ("") string
for some of the documents and below error is giving error XDMP-LEXVAL:
xs.date("") -- Invalid lexical value ""
Query:
cts.search(cts.andQuery([cts.jsonPropertyRangeQuery("dob", '>',
fn.currentDate().subtract(xs.yearMonthDuration("P25Y"))),cts.collectionQuery("col-1")]));

And
Q2: Any sample for group-by age by extracting date from dob?

Thanks
Shan.
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Eliot Kimber

What is TDE? I’m not conversant with ML 9 features yet.

Also, I’m currently working against an ML 4.2 server (don’t ask).

TaskBot looks like just what I need but docs say it requires ML 7+ but could 
possibly be made to work with earlier releases. If someone can point me in the 
right direction I can take a stab at making it work with ML 4.

Thanks,

Eliot
--
Eliot Kimber
http://contrext.com
 



On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf of Erik 
Hennum"  wrote:

Hi, Eliot:

On reflection, let me retract the range index suggestion.  I wasn't 
considering
the domain implied by the element names -- it would never make sense
to blow out a range index with the value of all of the paragraphs.

The TDE suggestion for MarkLogic 9 would still work, however, because you
could have an xs:short column with a value of 1 for every paragraph.


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Erik Hennum 
[erik.hen...@marklogic.com]
Sent: Tuesday, May 23, 2017 6:21 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number
of element names is small and you can create range indexes on them:

*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query 
and directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE
to project rows with columns for the id and element, group on
the id column, and count the rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
>
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Unsubscribe

2017-05-23 Thread John Snelson
On 23/05/17 14:57, Hanumantharayappa, Shanthamurthy wrote:
> To subscribe or unsubscribe via the World Wide Web, visit
> http://developer.marklogic.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
> general-requ...@developer.marklogic.com
>
-- 
John Snelson, Principal Engineer  http://twitter.com/jpcs
MarkLogic Corporation http://www.marklogic.com
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


[MarkLogic Dev General] Unsubscribe

2017-05-23 Thread Hanumantharayappa, Shanthamurthy
bankgroup.org>>
> Reply-To: MarkLogic Developer Discussion
> <general@developer.marklogic.com
> <mailto:general@developer.marklogic.com>>
> Date: Monday, May 22, 2017 at 8:59 PM
> To: "general@developer.marklogic.com<mailto:general@developer.
> marklogic.com>" <general@developer.marklogic.com<mailto:general@developer.
> marklogic.com>>
> Subject: [MarkLogic Dev General] Priorities for queries
>
> Hi,
>
> Is there a way to give a lower priority to certain queries? We have
> two different types of API consumers ? real users and various scripts.
> No matter how often scripts are hitting endpoints or how ?heavy? are
> their queries, they should not affect API performance for real users.
> In other words, scripts are tolerant of high latency, but users are not.
>
> Regards,
>
> Oleksii Segeda
>
> IT Analyst
>
> Information and Technology Solutions
>
> W
>
> www.worldbank.org<http://www.worldbank.org/>
>
> [http://siteresources.worldbank.org/NEWS/Images/spacer.png]
>
> [http://siteresources.worldbank.org/NEWS/Images/WBG_
> Information_and_Technology_Solutions.png]
>
>
>
> -- next part -- An HTML attachment was
> scrubbed...
> URL: http://developer.marklogic.com/pipermail/general/
> attachments/20170523/c01547ba/attachment.html
> -- next part -- A non-text attachment was
> scrubbed...
> Name: image003.png
> Type: image/png
> Size: 6577 bytes
> Desc: image003.png
> Url : http://developer.marklogic.com/pipermail/general/
> attachments/20170523/c01547ba/attachment.png
> -- next part -- A non-text attachment was
> scrubbed...
> Name: image002.png
> Type: image/png
> Size: 170 bytes
> Desc: image002.png
> Url : http://developer.marklogic.com/pipermail/general/
> attachments/20170523/c01547ba/attachment-0001.png
>
> --
>
> ___
> General mailing list
> General@developer.marklogic.com
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> End of General Digest, Vol 155, Issue 24
> 
>
-- next part --
An HTML attachment was scrubbed...
URL: 
http://developer.marklogic.com/pipermail/general/attachments/20170523/bf09ba37/attachment-0001.html

--

Message: 2
Date: Tue, 23 May 2017 13:56:00 +
From: Erik Hennum <erik.hen...@marklogic.com>
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs
to Get Statistics
To: MarkLogic Developer Discussion <general@developer.marklogic.com>
Message-ID:
<dfdf2fd50bf5aa42adaf93ff2e3ca1850c7f4...@exchg10-be02.marklogic.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi, Eliot:

On reflection, let me retract the range index suggestion.  I wasn't considering 
the domain implied by the element names -- it would never make sense to blow 
out a range index with the value of all of the paragraphs.

The TDE suggestion for MarkLogic 9 would still work, however, because you could 
have an xs:short column with a value of 1 for every paragraph.


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Erik Hennum 
[erik.hen...@marklogic.com]
Sent: Tuesday, May 23, 2017 6:21 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number of 
element names is small and you can create range indexes on them:

*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query and 
directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE to project 
rows with columns for the id and element, group on the id column, and count the 
rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I?d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in 
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will 
make optimal use of the TaskServer of

Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Erik Hennum
Hi, Eliot:

On reflection, let me retract the range index suggestion.  I wasn't considering
the domain implied by the element names -- it would never make sense
to blow out a range index with the value of all of the paragraphs.

The TDE suggestion for MarkLogic 9 would still work, however, because you
could have an xs:short column with a value of 1 for every paragraph.


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Erik Hennum 
[erik.hen...@marklogic.com]
Sent: Tuesday, May 23, 2017 6:21 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number
of element names is small and you can create range indexes on them:

*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query and 
directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE
to project rows with columns for the id and element, group on
the id column, and count the rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
>
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] General Digest, Vol 155, Issue 24

2017-05-23 Thread Shiv Shankar
neral mailing list
> > General@developer.marklogic.com
> > Manage your subscription at:
> > http://developer.marklogic.com/mailman/listinfo/general
>
>
>
> --
>
> Message: 3
> Date: Mon, 22 May 2017 22:43:26 -0500
> From: Eliot Kimber <ekim...@contrext.com>
> Subject: [MarkLogic Dev General] Processing Large Number of Docs to
> Get Statistics
> To: MarkLogic Developer Discussion <general@developer.marklogic.com>
> Message-ID: <bdf9d2b1-c160-455d-b836-bc11c1db7...@contrext.com>
> Content-Type: text/plain;   charset="UTF-8"
>
> I haven?t yet seen anything in the docs that directly address what I?m
> trying to do and suspect I?m simply missing some ML basics or just going
> about things the wrong way.
>
> I have a corpus of several hundred thousand docs (but could be millions,
> of course), where each doc is an average of 200K and several thousand
> elements.
>
> I want to analyze the corpus to get details about the number of specific
> subelements within each document, e.g.:
>
>
> for $article in cts:search(/Article, cts:directory-query("/Default/",
> "infinity"))[$start to $end]
>  return  paras=?{count($article//p}?/>
>
> I?m running this as a query from Oxygen (so I can capture the results
> locally so I can do other stuff with them).
>
> On the server I?m using I blow the expanded tree cache if I try to request
> more than about 20,000 docs.
>
> Is there a way to do this kind of processing over an arbitrarily large set
> *and* get the results back from a single query request?
>
> I think the only solution is to write the results to back to the database
> and then fetch that as the last thing but I was hoping there was something
> simpler.
>
> Have I missed an obvious solution?
>
> Thanks,
>
> Eliot
>
> --
> Eliot Kimber
> http://contrext.com
>
>
>
>
>
>
> --
>
> Message: 4
> Date: Tue, 23 May 2017 07:24:31 +
> From: Geert Josten <geert.jos...@marklogic.com>
> Subject: Re: [MarkLogic Dev General] Priorities for queries
> To: MarkLogic Developer Discussion <general@developer.marklogic.com>
> Message-ID: <d549af3b.117ddb%geert.jos...@marklogic.com>
> Content-Type: text/plain; charset="windows-1252"
>
> Hi Oleksii,
>
> If you use xdmp:spawn or xdmp:spawn-function, you would be able to use the
>  option. It takes ?normal? and ?higher? as values. These
> priorities have separate queues and worker threads, so they should
> interfere less with each other.
>
> It might also be worth looking into a way to push out low priority work to
> a dedicated host for longer running tasks. You could do that by writing
> such queries to the database, have a schedule running on that particular
> host monitor for such tasks, which picks them up 1 by 1, and writes back
> results once done. It might be easiest to switch around script queries to
> an asynchronous process that polls regularly to see if results have been
> written. Makes sense?
>
> Cheers,
> Geert
>
> From: <general-boun...@developer.marklogic.com<mailto:general-
> boun...@developer.marklogic.com>> on behalf of Oleksii Segeda <
> oseg...@worldbankgroup.org<mailto:oseg...@worldbankgroup.org>>
> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com
> <mailto:general@developer.marklogic.com>>
> Date: Monday, May 22, 2017 at 8:59 PM
> To: "general@developer.marklogic.com<mailto:general@developer.
> marklogic.com>" <general@developer.marklogic.com<mailto:general@developer.
> marklogic.com>>
> Subject: [MarkLogic Dev General] Priorities for queries
>
> Hi,
>
> Is there a way to give a lower priority to certain queries? We have two
> different types of API consumers ? real users and various scripts.
> No matter how often scripts are hitting endpoints or how ?heavy? are their
> queries, they should not affect API performance for real users.
> In other words, scripts are tolerant of high latency, but users are not.
>
> Regards,
>
> Oleksii Segeda
>
> IT Analyst
>
> Information and Technology Solutions
>
> W
>
> www.worldbank.org<http://www.worldbank.org/>
>
> [http://siteresources.worldbank.org/NEWS/Images/spacer.png]
>
> [http://siteresources.worldbank.org/NEWS/Images/WBG_
> Information_and_Technology_Solutions.png]
>
>
>
> -- next part --
> An HTML attachment was scrubbed...
> URL: http://developer.marklogic.com/pipermail/general/
> attachments/20170523/c01547ba/attachment.html
> -- next part --
> A non-text attachment was scrubbed...
> Name: image003.png
> Type: image/png
> Size: 6577 bytes
> Desc: image003.png
> Url : http://developer.marklogic.com/pipermail/general/
> attachments/20170523/c01547ba/attachment.png
> -- next part --
> A non-text attachment was scrubbed...
> Name: image002.png
> Type: image/png
> Size: 170 bytes
> Desc: image002.png
> Url : http://developer.marklogic.com/pipermail/general/
> attachments/20170523/c01547ba/attachment-0001.png
>
> --
>
> ___
> General mailing list
> General@developer.marklogic.com
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> End of General Digest, Vol 155, Issue 24
> 
>
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Erik Hennum
Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number 
of element names is small and you can create range indexes on them:

*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query and 
directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE
to project rows with columns for the id and element, group on 
the id column, and count the rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Geert Josten 
[geert.jos...@marklogic.com]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
>
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Priorities for queries

2017-05-23 Thread Oleksii Segeda
Hi Geert,

It makes sense. I guess on first query we can only return a ticket number, 
which can be used to access results.

Best,
Oleksii

From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Geert Josten
Sent: Tuesday, May 23, 2017 3:25 AM
To: MarkLogic Developer Discussion 
Subject: Re: [MarkLogic Dev General] Priorities for queries

Hi Oleksii,

If you use xdmp:spawn or xdmp:spawn-function, you would be able to use the 
 option. It takes 'normal' and 'higher' as values. These priorities 
have separate queues and worker threads, so they should interfere less with 
each other.

It might also be worth looking into a way to push out low priority work to a 
dedicated host for longer running tasks. You could do that by writing such 
queries to the database, have a schedule running on that particular host 
monitor for such tasks, which picks them up 1 by 1, and writes back results 
once done. It might be easiest to switch around script queries to an 
asynchronous process that polls regularly to see if results have been written. 
Makes sense?

Cheers,
Geert

From: 
>
 on behalf of Oleksii Segeda 
>
Reply-To: MarkLogic Developer Discussion 
>
Date: Monday, May 22, 2017 at 8:59 PM
To: "general@developer.marklogic.com" 
>
Subject: [MarkLogic Dev General] Priorities for queries

Hi,

Is there a way to give a lower priority to certain queries? We have two 
different types of API consumers - real users and various scripts.
No matter how often scripts are hitting endpoints or how "heavy" are their 
queries, they should not affect API performance for real users.
In other words, scripts are tolerant of high latency, but users are not.

Regards,

Oleksii Segeda

IT Analyst

Information and Technology Solutions

W

www.worldbank.org

[http://siteresources.worldbank.org/NEWS/Images/spacer.png]

[http://siteresources.worldbank.org/NEWS/Images/WBG_Information_and_Technology_Solutions.png]



___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

2017-05-23 Thread Geert Josten
Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of
Eliot Kimber"  wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
> return paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
> 
>
>
>
>___
>General mailing list
>General@developer.marklogic.com
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Priorities for queries

2017-05-23 Thread Geert Josten
Hi Oleksii,

If you use xdmp:spawn or xdmp:spawn-function, you would be able to use the 
 option. It takes ’normal’ and ‘higher’ as values. These priorities 
have separate queues and worker threads, so they should interfere less with 
each other.

It might also be worth looking into a way to push out low priority work to a 
dedicated host for longer running tasks. You could do that by writing such 
queries to the database, have a schedule running on that particular host 
monitor for such tasks, which picks them up 1 by 1, and writes back results 
once done. It might be easiest to switch around script queries to an 
asynchronous process that polls regularly to see if results have been written. 
Makes sense?

Cheers,
Geert

From: 
>
 on behalf of Oleksii Segeda 
>
Reply-To: MarkLogic Developer Discussion 
>
Date: Monday, May 22, 2017 at 8:59 PM
To: "general@developer.marklogic.com" 
>
Subject: [MarkLogic Dev General] Priorities for queries

Hi,

Is there a way to give a lower priority to certain queries? We have two 
different types of API consumers – real users and various scripts.
No matter how often scripts are hitting endpoints or how “heavy” are their 
queries, they should not affect API performance for real users.
In other words, scripts are tolerant of high latency, but users are not.

Regards,

Oleksii Segeda

IT Analyst

Information and Technology Solutions

W

www.worldbank.org

[http://siteresources.worldbank.org/NEWS/Images/spacer.png]

[http://siteresources.worldbank.org/NEWS/Images/WBG_Information_and_Technology_Solutions.png]



___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general