RE: [MarkLogic Dev General] xdmp:estimate.. (was: triggering after spawning)

Geert Josten Tue, 14 Jul 2009 13:38:54 -0700

Mike,

No I hadn't, actually. But now I have. :-)


I have only little documents in the database, but even for this little profile 
timing drops from 0.2 sec to 0.02 sec. Unfortunately, I was expecting the 
following numbers:

Total: 1802
Done: 1802 (100%)
Active: 0 (0%)
Error: 0 (0%)

But am now getting:

Total: 1802
Done: 1802 (100%)
Active: -1775 (-99%)
Error: 1775 (99%)

Apparently, the estimate for error docs differs from the real count:

let $error-count := xdmp:estimate( 
xdmp:document-properties()/prop:properties[cpf:state/text() = 
'http://marklogic.com/states/error'] )

I am guessing it is some sophisticated 'feature' of xdmp:estimate being 
fragment based, but have trouble figuring things out.

Some database statistics:
Docs: 1,802
Fragments: 3,619
Deleted: 370
Stands: 2

A merge didn't make any different, other than clearing deleted fragments..

Any ideas? Anyone?

Kind regards,
Geert

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Michael Blakeley
> Sent: dinsdag 14 juli 2009 17:39
> To: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] triggering after spawning
> 
> Geert,
> 
> Have you tried xdmp:estimate() instead of count()? The 
> difference is that count() generally drives I/O, while 
> xdmp:estimate() does not. For this purpose, I believe that 
> both will return the same results using the default indexes. 
> I don't think any special indexes are needed.
> 
> thanks,
> -- Mike
> 
> On 2009-07-14 07:55, Geert Josten wrote:
> > Hi Jakob,
> >
> > I am, quite brutely, doing things like this:
> >
> > let $total-count := count( 
> > xdmp:document-properties()/prop:properties/cpf:processing-status )
> >
> > let $done-count := count( 
> > 
> xdmp:document-properties()/prop:properties[cpf:processing-status/text(
> > ) = 'done' and not(cpf:state/text() = 
> > 'http://marklogic.com/states/error')] ) let $error-count := count( 
> > xdmp:document-properties()/prop:properties[cpf:state/text() = 
> > 'http://marklogic.com/states/error'] ) let $active-count := 
> > $total-count - $error-count - $done-count
> >
> > No looping, just xpath with predicates wrapped in a count. 
> No special indexes (yet)..
> >
> > Kind regards,
> > Geert
> >
> >> -----Original Message-----
> >> From: [email protected]
> >> [mailto:[email protected]] On Behalf 
> Of Jakob 
> >> Fix
> >> Sent: dinsdag 14 juli 2009 16:44
> >> To: General Mark Logic Developer Discussion
> >> Subject: Re: [MarkLogic Dev General] triggering after spawning
> >>
> >> Geert,
> >>
> >> Good question about storing this info at all.  Doing a 
> normal xpath 
> >> takes clearly too long (five seconds or so), so yes, 
> you're right, I 
> >> will test the index on the attribute value.
> >>
> >> cheers,
> >> Jakob.
> >>
> >>
> >>
> >> On Tue, Jul 14, 2009 at 16:36, Geert
> >> Josten<[email protected]>  wrote:
> >>> I am wondering why storing it in the database at all. Why
> >> not calculate it on demand? Putting an index on the 
> boolean element 
> >> should allow it to perform even when you have processed many many 
> >> many documents..
> >>> You might even try doing it without adding a particular
> >> index. It might be covered by the word index already..
> >>> I did a similar thing to keep track of all document being
> >> processed by CPF, using counts on all documents with specific 
> >> property values to show a progress bar. I haven't tried it 
> with many 
> >> documents yet, but just showing the progress bar based on about 4 
> >> counts, takes only a few tens of a second..
> >> Didn't need any special indexes at all..
> >>> Kind regards,
> >>> Geert
> >>>
> >>>> -----Original Message-----
> >>>> From: [email protected]
> >>>> [mailto:[email protected]] On Behalf
> >> Of Jakob
> >>>> Fix
> >>>> Sent: dinsdag 14 juli 2009 16:27
> >>>> To: General Mark Logic Developer Discussion
> >>>> Subject: Re: [MarkLogic Dev General] triggering after spawning
> >>>>
> >>>> Geert,
> >>>>
> >>>> thanks for the quick reply. Some more information which
> >> explains the
> >>>> logic behind what I'm doing:
> >>>>
> >>>> Each day I get an input document containing a(n
> >> increasing) number of
> >>>> URLs (currently around 23.000) which return XML documents,
> >> containing
> >>>> among other things a boolean value.
> >>>> Each day, I record the total number of documents actually
> >> retrieved,
> >>>> the number of "true" and the number of "false"
> >>>> (the total number being a kind of checksum).
> >>>>
> >>>> The summary document looks a bit like this:
> >>>>
> >>>> <doi-stats>
> >>>> ...
> >>>>    <doi-stat date="2009-07-14"
> >>>>        recorded="{fn:current-dateTime()}" resolved="123"
> >>>>        unresolved="456" total="579" />  ...
> >>>> </doi-stats>
> >>>>
> >>>> Now, you're right it might be possible for each spawned task to 
> >>>> update this document, however, wouldn't there be a serious 
> >>>> performance impact?
> >>>>
> >>>> First, I would have to decrease the number of concurrent tasks 
> >>>> (currently six) to maybe two (or even one?), so that
> >> there's not too
> >>>> much time spent waiting to update the document.  Second, 
> for each 
> >>>> document I would need to count all documents in the 
> collection (or 
> >>>> the directory), and third, I'd need do the two xpaths to
> >> retrieve the
> >>>> booleans ...
> >>>>
> >>>> The more I think about this approach, the less I'm 
> convinced that 
> >>>> it's scalable, but I'd be more than happy to be convinced
> >> otherwise!
> >>>> thanks,
> >>>> Jakob.
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Jul 14, 2009 at 16:02, Geert 
> >>>> Josten<[email protected]>  wrote:
> >>>>> Or just have each task update the summary document, each
> >>>> incrementing the finished docs counter by one (if there is any)?
> >>>>> Note: that effectively serialize all tasks..
> >>>>>
> >>>>> Kind regards,
> >>>>> Geert
> >>>>>
> >>>>>
> >>>>> Drs. G.P.H. Josten
> >>>>> Consultant
> >>>>>
> >>>>>
> >>>>> http://www.daidalos.nl/
> >>>>> Daidalos BV
> >>>>> Source of Innovation
> >>>>> Hoekeindsehof 1-4
> >>>>> 2665 JZ Bleiswijk
> >>>>> Tel.: +31 (0) 10 850 1200
> >>>>> Fax: +31 (0) 10 850 1199
> >>>>> http://www.daidalos.nl/
> >>>>> KvK 27164984
> >>>>> De informatie - verzonden in of met dit emailbericht - is
> >>>> afkomstig van Daidalos BV en is uitsluitend bestemd voor de 
> >>>> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, 
> >>>> verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen 
> >>>> rechten worden ontleend.
> >>>>>
> >>>>>> From: [email protected]
> >>>>>> [mailto:[email protected]] On Behalf
> >>>> Of Jakob
> >>>>>> Fix
> >>>>>> Sent: dinsdag 14 juli 2009 15:55
> >>>>>> To: General Mark Logic Developer Discussion
> >>>>>> Subject: [MarkLogic Dev General] triggering after spawning
> >>>>>>
> >>>>>> So I manage to spawn some twenty thousand tasks to
> >>>> retrieve documents
> >>>>>> from a remote server and to store them in MarkLogic.  
> I've also 
> >>>>>> created a user interface with a progress bar to follow its
> >>>> progress
> >>>>>> (although this won't be used in production).
> >>>>>>
> >>>>>> Now, what I'd like to do is to trigger an update of a summary 
> >>>>>> document once all spawned tasks have executed. From my limited 
> >>>>>> experience with ML, I cannot seem to find a satisfying
> >> solution to
> >>>>>> this challenge ...
> >>>>>>
> >>>>>> My ideas:
> >>>>>> - After the spawn call a function recursively which sleeps
> >>>> for some
> >>>>>> time and checks the number of tasks in the task queue, and
> >>>> once it's
> >>>>>> empty assumes "that that's that" and updates/creates a 
> document?
> >>>>>> - Have each spawned task inspect the task queue and if
> >>>> there is just
> >>>>>> one task in the queue (i.e. itself), trigger the
> >> document update?
> >>>>>> Hmmm, any better ideas?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jakob.
> >>>>>> _______________________________________________
> >>>>>> General mailing list
> >>>>>> [email protected]
> >>>>>> http://xqzone.com/mailman/listinfo/general
> >>>>>>
> >>>>> _______________________________________________
> >>>>> General mailing list
> >>>>> [email protected]
> >>>>> http://xqzone.com/mailman/listinfo/general
> >>>>>
> >>>> _______________________________________________
> >>>> General mailing list
> >>>> [email protected]
> >>>> http://xqzone.com/mailman/listinfo/general
> >>>> _______________________________________________
> >>> General mailing list
> >>> [email protected]
> >>> http://xqzone.com/mailman/listinfo/general
> >>>
> >> _______________________________________________
> >> General mailing list
> >> [email protected]
> >> http://xqzone.com/mailman/listinfo/general
> >> _______________________________________________
> > General mailing list
> > [email protected]
> > http://xqzone.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] xdmp:estimate.. (was: triggering after spawning)

Reply via email to