RE: [MarkLogic Dev General] xdmp:estimate..

Geert Josten Tue, 14 Jul 2009 22:57:30 -0700

Thanks, that does the trick!


> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Michael Blakeley
> Sent: dinsdag 14 juli 2009 22:44
> To: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] xdmp:estimate..
> 
> Geert,
> 
> Try removing the '/text()' step. It isn't necessary, and 
> seems to confuse the evaluator in this case.
> 
> import module namespace cpf = "http://marklogic.com/cpf"; at 
> "/MarkLogic/cpf/cpf.xqy";
> 
> xdmp:estimate(
>    xdmp:document-properties()/prop:properties[
>      cpf:state/text() = 'http://marklogic.com/states/error'] 
> ) , xdmp:estimate(
>    xdmp:document-properties()/prop:properties[
>      cpf:state = 'http://marklogic.com/states/error'] ) , count(
>    xdmp:document-properties()/prop:properties[
>      cpf:state = 'http://marklogic.com/states/error'] )
> 
> => 2 0 0
> 
> I believe that '2' is the count of property fragments that 
> have prop:properties and cpf:state elements, ignoring the 
> value of cpf:state.
> 
> -- Mike
> 
> On 2009-07-14 13:38, Geert Josten wrote:
> > Mike,
> >
> > No I hadn't, actually. But now I have. :-)
> >
> > I have only little documents in the database, but even for 
> this little profile timing drops from 0.2 sec to 0.02 sec. 
> Unfortunately, I was expecting the following numbers:
> >
> > Total: 1802
> > Done: 1802 (100%)
> > Active: 0 (0%)
> > Error: 0 (0%)
> >
> > But am now getting:
> >
> > Total: 1802
> > Done: 1802 (100%)
> > Active: -1775 (-99%)
> > Error: 1775 (99%)
> >
> > Apparently, the estimate for error docs differs from the real count:
> >
> > let $error-count := xdmp:estimate( 
> > xdmp:document-properties()/prop:properties[cpf:state/text() = 
> > 'http://marklogic.com/states/error'] )
> >
> > I am guessing it is some sophisticated 'feature' of 
> xdmp:estimate being fragment based, but have trouble figuring 
> things out.
> >
> > Some database statistics:
> > Docs: 1,802
> > Fragments: 3,619
> > Deleted: 370
> > Stands: 2
> >
> > A merge didn't make any different, other than clearing 
> deleted fragments..
> >
> > Any ideas? Anyone?
> >
> > Kind regards,
> > Geert
> >
> >> -----Original Message-----
> >> From: [email protected]
> >> [mailto:[email protected]] On Behalf 
> Of Michael 
> >> Blakeley
> >> Sent: dinsdag 14 juli 2009 17:39
> >> To: General Mark Logic Developer Discussion
> >> Subject: Re: [MarkLogic Dev General] triggering after spawning
> >>
> >> Geert,
> >>
> >> Have you tried xdmp:estimate() instead of count()? The 
> difference is 
> >> that count() generally drives I/O, while
> >> xdmp:estimate() does not. For this purpose, I believe that 
> both will 
> >> return the same results using the default indexes.
> >> I don't think any special indexes are needed.
> >>
> >> thanks,
> >> -- Mike
> >>
> >> On 2009-07-14 07:55, Geert Josten wrote:
> >>> Hi Jakob,
> >>>
> >>> I am, quite brutely, doing things like this:
> >>>
> >>> let $total-count := count(
> >>> xdmp:document-properties()/prop:properties/cpf:processing-status )
> >>>
> >>> let $done-count := count(
> >>>
> >> 
> xdmp:document-properties()/prop:properties[cpf:processing-status/text
> >> (
> >>> ) = 'done' and not(cpf:state/text() = 
> >>> 'http://marklogic.com/states/error')] ) let $error-count := count(
> >>> xdmp:document-properties()/prop:properties[cpf:state/text() = 
> >>> 'http://marklogic.com/states/error'] ) let $active-count := 
> >>> $total-count - $error-count - $done-count
> >>>
> >>> No looping, just xpath with predicates wrapped in a count.
> >> No special indexes (yet)..
> >>> Kind regards,
> >>> Geert
> >>>
> >>>> -----Original Message-----
> >>>> From: [email protected]
> >>>> [mailto:[email protected]] On Behalf
> >> Of Jakob
> >>>> Fix
> >>>> Sent: dinsdag 14 juli 2009 16:44
> >>>> To: General Mark Logic Developer Discussion
> >>>> Subject: Re: [MarkLogic Dev General] triggering after spawning
> >>>>
> >>>> Geert,
> >>>>
> >>>> Good question about storing this info at all.  Doing a
> >> normal xpath
> >>>> takes clearly too long (five seconds or so), so yes,
> >> you're right, I
> >>>> will test the index on the attribute value.
> >>>>
> >>>> cheers,
> >>>> Jakob.
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Jul 14, 2009 at 16:36, Geert
> >>>> Josten<[email protected]>   wrote:
> >>>>> I am wondering why storing it in the database at all. Why
> >>>> not calculate it on demand? Putting an index on the
> >> boolean element
> >>>> should allow it to perform even when you have processed 
> many many 
> >>>> many documents..
> >>>>> You might even try doing it without adding a particular
> >>>> index. It might be covered by the word index already..
> >>>>> I did a similar thing to keep track of all document being
> >>>> processed by CPF, using counts on all documents with specific 
> >>>> property values to show a progress bar. I haven't tried it
> >> with many
> >>>> documents yet, but just showing the progress bar based 
> on about 4 
> >>>> counts, takes only a few tens of a second..
> >>>> Didn't need any special indexes at all..
> >>>>> Kind regards,
> >>>>> Geert
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: [email protected]
> >>>>>> [mailto:[email protected]] On Behalf
> >>>> Of Jakob
> >>>>>> Fix
> >>>>>> Sent: dinsdag 14 juli 2009 16:27
> >>>>>> To: General Mark Logic Developer Discussion
> >>>>>> Subject: Re: [MarkLogic Dev General] triggering after spawning
> >>>>>>
> >>>>>> Geert,
> >>>>>>
> >>>>>> thanks for the quick reply. Some more information which
> >>>> explains the
> >>>>>> logic behind what I'm doing:
> >>>>>>
> >>>>>> Each day I get an input document containing a(n
> >>>> increasing) number of
> >>>>>> URLs (currently around 23.000) which return XML documents,
> >>>> containing
> >>>>>> among other things a boolean value.
> >>>>>> Each day, I record the total number of documents actually
> >>>> retrieved,
> >>>>>> the number of "true" and the number of "false"
> >>>>>> (the total number being a kind of checksum).
> >>>>>>
> >>>>>> The summary document looks a bit like this:
> >>>>>>
> >>>>>> <doi-stats>
> >>>>>> ...
> >>>>>>     <doi-stat date="2009-07-14"
> >>>>>>         recorded="{fn:current-dateTime()}" resolved="123"
> >>>>>>         unresolved="456" total="579" />   ...
> >>>>>> </doi-stats>
> >>>>>>
> >>>>>> Now, you're right it might be possible for each 
> spawned task to 
> >>>>>> update this document, however, wouldn't there be a serious 
> >>>>>> performance impact?
> >>>>>>
> >>>>>> First, I would have to decrease the number of concurrent tasks 
> >>>>>> (currently six) to maybe two (or even one?), so that
> >>>> there's not too
> >>>>>> much time spent waiting to update the document.  Second,
> >> for each
> >>>>>> document I would need to count all documents in the
> >> collection (or
> >>>>>> the directory), and third, I'd need do the two xpaths to
> >>>> retrieve the
> >>>>>> booleans ...
> >>>>>>
> >>>>>> The more I think about this approach, the less I'm
> >> convinced that
> >>>>>> it's scalable, but I'd be more than happy to be convinced
> >>>> otherwise!
> >>>>>> thanks,
> >>>>>> Jakob.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jul 14, 2009 at 16:02, Geert
> >>>>>> Josten<[email protected]>   wrote:
> >>>>>>> Or just have each task update the summary document, each
> >>>>>> incrementing the finished docs counter by one (if 
> there is any)?
> >>>>>>> Note: that effectively serialize all tasks..
> >>>>>>>
> >>>>>>> Kind regards,
> >>>>>>> Geert
> >>>>>>>
> >>>>>>>
> >>>>>>> Drs. G.P.H. Josten
> >>>>>>> Consultant
> >>>>>>>
> >>>>>>>
> >>>>>>> http://www.daidalos.nl/
> >>>>>>> Daidalos BV
> >>>>>>> Source of Innovation
> >>>>>>> Hoekeindsehof 1-4
> >>>>>>> 2665 JZ Bleiswijk
> >>>>>>> Tel.: +31 (0) 10 850 1200
> >>>>>>> Fax: +31 (0) 10 850 1199
> >>>>>>> http://www.daidalos.nl/
> >>>>>>> KvK 27164984
> >>>>>>> De informatie - verzonden in of met dit emailbericht - is
> >>>>>> afkomstig van Daidalos BV en is uitsluitend bestemd voor de 
> >>>>>> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, 
> >>>>>> verzoeken wij u het te verwijderen. Aan dit bericht 
> kunnen geen 
> >>>>>> rechten worden ontleend.
> >>>>>>>> From: [email protected]
> >>>>>>>> [mailto:[email protected]] On Behalf
> >>>>>> Of Jakob
> >>>>>>>> Fix
> >>>>>>>> Sent: dinsdag 14 juli 2009 15:55
> >>>>>>>> To: General Mark Logic Developer Discussion
> >>>>>>>> Subject: [MarkLogic Dev General] triggering after spawning
> >>>>>>>>
> >>>>>>>> So I manage to spawn some twenty thousand tasks to
> >>>>>> retrieve documents
> >>>>>>>> from a remote server and to store them in MarkLogic.
> >> I've also
> >>>>>>>> created a user interface with a progress bar to follow its
> >>>>>> progress
> >>>>>>>> (although this won't be used in production).
> >>>>>>>>
> >>>>>>>> Now, what I'd like to do is to trigger an update of 
> a summary 
> >>>>>>>> document once all spawned tasks have executed. From 
> my limited 
> >>>>>>>> experience with ML, I cannot seem to find a satisfying
> >>>> solution to
> >>>>>>>> this challenge ...
> >>>>>>>>
> >>>>>>>> My ideas:
> >>>>>>>> - After the spawn call a function recursively which sleeps
> >>>>>> for some
> >>>>>>>> time and checks the number of tasks in the task queue, and
> >>>>>> once it's
> >>>>>>>> empty assumes "that that's that" and updates/creates a
> >> document?
> >>>>>>>> - Have each spawned task inspect the task queue and if
> >>>>>> there is just
> >>>>>>>> one task in the queue (i.e. itself), trigger the
> >>>> document update?
> >>>>>>>> Hmmm, any better ideas?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Jakob.
> >>>>>>>> _______________________________________________
> >>>>>>>> General mailing list
> >>>>>>>> [email protected] 
> >>>>>>>> http://xqzone.com/mailman/listinfo/general
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> General mailing list
> >>>>>>> [email protected]
> >>>>>>> http://xqzone.com/mailman/listinfo/general
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> General mailing list
> >>>>>> [email protected]
> >>>>>> http://xqzone.com/mailman/listinfo/general
> >>>>>> _______________________________________________
> >>>>> General mailing list
> >>>>> [email protected]
> >>>>> http://xqzone.com/mailman/listinfo/general
> >>>>>
> >>>> _______________________________________________
> >>>> General mailing list
> >>>> [email protected]
> >>>> http://xqzone.com/mailman/listinfo/general
> >>>> _______________________________________________
> >>> General mailing list
> >>> [email protected]
> >>> http://xqzone.com/mailman/listinfo/general
> >> _______________________________________________
> >> General mailing list
> >> [email protected]
> >> http://xqzone.com/mailman/listinfo/general
> >> _______________________________________________
> > General mailing list
> > [email protected]
> > http://xqzone.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] xdmp:estimate..

Reply via email to