Re: [MarkLogic Dev General] xdmp:estimate..

Michael Blakeley Tue, 14 Jul 2009 13:44:15 -0700

Geert,

Try removing the '/text()' step. It isn't necessary, and seems toconfuse the evaluator in this case.

import module namespace cpf = "http://marklogic.com/cpf"; at"/MarkLogic/cpf/cpf.xqy";


xdmp:estimate(
  xdmp:document-properties()/prop:properties[
    cpf:state/text() = 'http://marklogic.com/states/error'] )
,
xdmp:estimate(
  xdmp:document-properties()/prop:properties[
    cpf:state = 'http://marklogic.com/states/error'] )
,
count(
  xdmp:document-properties()/prop:properties[
    cpf:state = 'http://marklogic.com/states/error'] )

=> 2 0 0

I believe that '2' is the count of property fragments that haveprop:properties and cpf:state elements, ignoring the value of cpf:state.


-- Mike

On 2009-07-14 13:38, Geert Josten wrote:

Mike,

No I hadn't, actually. But now I have. :-)

I have only little documents in the database, but even for this little profile 
timing drops from 0.2 sec to 0.02 sec. Unfortunately, I was expecting the 
following numbers:

Total: 1802
Done: 1802 (100%)
Active: 0 (0%)
Error: 0 (0%)

But am now getting:

Total: 1802
Done: 1802 (100%)
Active: -1775 (-99%)
Error: 1775 (99%)

Apparently, the estimate for error docs differs from the real count:

let $error-count := xdmp:estimate( 
xdmp:document-properties()/prop:properties[cpf:state/text() = 
'http://marklogic.com/states/error'] )

I am guessing it is some sophisticated 'feature' of xdmp:estimate being 
fragment based, but have trouble figuring things out.

Some database statistics:
Docs: 1,802
Fragments: 3,619
Deleted: 370
Stands: 2

A merge didn't make any different, other than clearing deleted fragments..

Any ideas? Anyone?

Kind regards,
Geert

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of
Michael Blakeley
Sent: dinsdag 14 juli 2009 17:39
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] triggering after spawning

Geert,

Have you tried xdmp:estimate() instead of count()? The
difference is that count() generally drives I/O, while
xdmp:estimate() does not. For this purpose, I believe that
both will return the same results using the default indexes.
I don't think any special indexes are needed.

thanks,
-- Mike

On 2009-07-14 07:55, Geert Josten wrote:

Hi Jakob,

I am, quite brutely, doing things like this:

let $total-count := count(
xdmp:document-properties()/prop:properties/cpf:processing-status )

let $done-count := count(

xdmp:document-properties()/prop:properties[cpf:processing-status/text(

) = 'done' and not(cpf:state/text() =
'http://marklogic.com/states/error')] ) let $error-count := count(
xdmp:document-properties()/prop:properties[cpf:state/text() =
'http://marklogic.com/states/error'] ) let $active-count :=
$total-count - $error-count - $done-count

No looping, just xpath with predicates wrapped in a count.

No special indexes (yet)..

Kind regards,
Geert

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf

Of Jakob

Fix
Sent: dinsdag 14 juli 2009 16:44
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] triggering after spawning

Geert,

Good question about storing this info at all.  Doing a

normal xpath

takes clearly too long (five seconds or so), so yes,

you're right, I

will test the index on the attribute value.

cheers,
Jakob.



On Tue, Jul 14, 2009 at 16:36, Geert
Josten<[email protected]>   wrote:

I am wondering why storing it in the database at all. Why

not calculate it on demand? Putting an index on the

boolean element

should allow it to perform even when you have processed many many
many documents..

You might even try doing it without adding a particular

index. It might be covered by the word index already..

I did a similar thing to keep track of all document being

processed by CPF, using counts on all documents with specific
property values to show a progress bar. I haven't tried it

with many

documents yet, but just showing the progress bar based on about 4
counts, takes only a few tens of a second..
Didn't need any special indexes at all..

Kind regards,
Geert

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf

Of Jakob

Fix
Sent: dinsdag 14 juli 2009 16:27
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] triggering after spawning

Geert,

thanks for the quick reply. Some more information which

explains the

logic behind what I'm doing:

Each day I get an input document containing a(n

increasing) number of

URLs (currently around 23.000) which return XML documents,

containing

among other things a boolean value.
Each day, I record the total number of documents actually

retrieved,

the number of "true" and the number of "false"
(the total number being a kind of checksum).

The summary document looks a bit like this:

<doi-stats>
...
    <doi-stat date="2009-07-14"
        recorded="{fn:current-dateTime()}" resolved="123"
        unresolved="456" total="579" />   ...
</doi-stats>

Now, you're right it might be possible for each spawned task to
update this document, however, wouldn't there be a serious
performance impact?

First, I would have to decrease the number of concurrent tasks
(currently six) to maybe two (or even one?), so that

there's not too

much time spent waiting to update the document.  Second,

for each

document I would need to count all documents in the

collection (or

the directory), and third, I'd need do the two xpaths to

retrieve the

booleans ...

The more I think about this approach, the less I'm

convinced that

it's scalable, but I'd be more than happy to be convinced

otherwise!

thanks,
Jakob.



On Tue, Jul 14, 2009 at 16:02, Geert
Josten<[email protected]>   wrote:

Or just have each task update the summary document, each

incrementing the finished docs counter by one (if there is any)?

Note: that effectively serialize all tasks..

Kind regards,
Geert


Drs. G.P.H. Josten
Consultant


http://www.daidalos.nl/
Daidalos BV
Source of Innovation
Hoekeindsehof 1-4
2665 JZ Bleiswijk
Tel.: +31 (0) 10 850 1200
Fax: +31 (0) 10 850 1199
http://www.daidalos.nl/
KvK 27164984
De informatie - verzonden in of met dit emailbericht - is

afkomstig van Daidalos BV en is uitsluitend bestemd voor de
geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen,
verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen
rechten worden ontleend.

From: [email protected]
[mailto:[email protected]] On Behalf

Of Jakob

Fix
Sent: dinsdag 14 juli 2009 15:55
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] triggering after spawning

So I manage to spawn some twenty thousand tasks to

retrieve documents

from a remote server and to store them in MarkLogic.

I've also

created a user interface with a progress bar to follow its

progress

(although this won't be used in production).

Now, what I'd like to do is to trigger an update of a summary
document once all spawned tasks have executed. From my limited
experience with ML, I cannot seem to find a satisfying

solution to

this challenge ...

My ideas:
- After the spawn call a function recursively which sleeps

for some

time and checks the number of tasks in the task queue, and

once it's

empty assumes "that that's that" and updates/creates a

document?

- Have each spawned task inspect the task queue and if

there is just

one task in the queue (i.e. itself), trigger the

document update?

Hmmm, any better ideas?

Thanks,
Jakob.
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________

General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________

General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________

General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] xdmp:estimate..

Reply via email to