Re: [MarkLogic Dev General] Noob query question..

Eliot Kimber Thu, 24 Aug 2017 15:05:27 -0700

I just went through an exercise similar to this with my profiling application.


 

In my case I’m capturing the output from the profiler for a large number of 
processing instances (potentially millions). I’m measuring both raw performance 
and also at-scale performance for processing of a large corpus, so it’s not 
sufficient to just profile a few cases. We know there is wide variation in 
performance for different input documents and we also want to see trends, both 
within the data and over time as the data, code, and servers evolve. So I’m 
measuring everything.

 

I want to know which expressions take the most time across all the instances 
and get a count. For example, in the longest instances one particular 
expression is always the top one but is it the top one in the faster instances?

 

The information is in the profiler histogram output but it is not ordered by 
shallow time (the value I’m interested in), so it’s not as easy as just getting 
the first expression for each histogram.

 

The solution approach, developed for me by Evan Lenz, is to use a trick with 
co-occurrence queries where attributes on the same element have a proximity of 
zero. If you construct an index over two attributes you can then use 
cts:value-co-occurrences() to get all the pairs and then select the ones you 
want. This approach also requires that each set be in a separate document (so 
that you can limit each call to cts:value-co-occurrences() to a single document 
using cts:document-query(). If there were multiple profiling results in a 
single document there would be no index-based way to limit 
value-co-occurrences() to a single profiling instance.

 

To enable this I had to post-process the output the MarkLogic profiler to add 
attributes to the prof:expression elements with the shallow-time and 
expr-source values, which are otherwise within subelements, the result being:

 

<prof:expression shallow-time="PT0.000002S" 
expr-source="fn:exists($node/self::*)" expr-id="2746390622476927112" count="1">

 

(I used a simple XSLT transform for this part of the processing, applied as I 
store my profiling results.)

 

I then defined attribute range indexes with word positions turned on for the 
@shallow-time and @expr-source attributes.

 

With that I could then do this to find the longest for each profiling instance 
(where each profiling instance is stored as a separate document):

 

let $maps :=

for $uri in cts:uris((), (), cts:collection-query($collection))
      let $expression-index := 
cts:element-attribute-reference(xs:QName("prof:expression"),xs:QName("expr-source"))
      let $shallow-time-index := 
cts:element-attribute-reference(xs:QName("prof:expression"),xs:QName("shallow-time"))
      let $max := cts:max($shallow-time-index, (), cts:document-query($uri))
      let $co-occurrences :=
          cts:value-co-occurrences(
              $expression-index,
              $shallow-time-index,
              ("proximity=0", "map"),
              cts:document-query($uri)
          )      
      let $max-co-occurrence := 
          for $map in $co-occurrences
          let $keys := map:keys($map)
          for $key in $keys
          return if ($max eq xs:dayTimeDuration(map:get($map, $key)[1])) 
              then 
              map:entry($key, map:get($map, $key))
              else ()
      return $max-co-occurrence

 

--

Eliot Kimber

http://contrext.com

 

From: <[email protected]> on behalf of "Ladner, Eric 
(Eric.Ladner)" <[email protected]>
Reply-To: MarkLogic Developer Discussion <[email protected]>
Date: Thursday, August 24, 2017 at 4:30 PM
To: MarkLogic Developer Discussion <[email protected]>
Subject: Re: [MarkLogic Dev General] Noob query question..

 

Thank you.  I will play with this in my development environment tomorrow.  I 
don’t quote see how it’s getting the counts per subject, though.

 

For reference.. the structure is similar to this:

 

<note>

   <subject>Test Subject</subject>

  <date_taken>2017-04-01T15:32:00</date_taken>

  <content>Blah, blah</content>

</note>

 

There would be many notes, obviously and the output would ideally be something 
like (not married to that output, but some output showing the counts for each 
subject over that time range).

 

<counts>

  <countItem>

     <subject>Test Subject</subject>

     <count>2</count>

  </countItem>

  <countItem>

    <subject>Subject 2</subject>

    <count>4</count>

  </countItem>

   ...

</counts>

 

Eric Ladner

Systems Analyst

[email protected]

 

 

From: [email protected] 
[mailto:[email protected]] On Behalf Of Sam Mefford
Sent: August 24, 2017 15:59
To: MarkLogic Developer Discussion <[email protected]>
Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Noob query question..

 

I should point out that this is not the fastest way to do it.  A faster way 
would be to index "date-taken" as a dateTime element range index and use 
cts:search with cts:element-range-query.

 

Sam Mefford
Senior Engineer
MarkLogic Corporation
[email protected]
Cell: +1 801 706 9731
www.marklogic.com

This e-mail and any accompanying attachments are confidential. The information 
is intended solely for the use of the individual to whom it is addressed. Any 
review, disclosure, copying, distribution, or use of this e-mail communication 
by others is strictly prohibited. If you are not the intended recipient, please 
notify us immediately by returning this message to the sender and delete all 
copies. Thank you for your cooperation.

From: [email protected] 
[[email protected]] on behalf of Sam Mefford 
[[email protected]]
Sent: Thursday, August 24, 2017 2:56 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Noob query question..

XQuery is an extension of XPath.  Here's an example in XPath.  These things are 
easiest to understand if we know the structure of your docs.  Suppose I insert: 

 

xdmp:document-insert("test.xml", 
<note><date-taken>2015-01-01</date-taken></note>)

 

I could find the count of docs more than two years old like this:

 

count(/note[fn:days-from-duration(fn:current-date() - date-taken) > (365 * 2)])

 

 

Sam Mefford
Senior Engineer
MarkLogic Corporation
[email protected]
Cell: +1 801 706 9731
www.marklogic.com

This e-mail and any accompanying attachments are confidential. The information 
is intended solely for the use of the individual to whom it is addressed. Any 
review, disclosure, copying, distribution, or use of this e-mail communication 
by others is strictly prohibited. If you are not the intended recipient, please 
notify us immediately by returning this message to the sender and delete all 
copies. Thank you for your cooperation.

From: [email protected] 
[[email protected]] on behalf of Ladner, Eric 
(Eric.Ladner) [[email protected]]
Sent: Thursday, August 24, 2017 2:24 PM
To: MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Noob query question..

I’m still rather new to MarkLogic and apparently have a lot to learn.  When 
doing research on a proof of concept, I ran across a situation that would be 
trivial to solve in SQL, but I’m having problems wrapping my head around how to 
do that in XQuery.  Or, is XQuery even the right place for this?

 

Basically, the number of notes per subject for any note that’s less than two 
years old.  If I was to do this in SQL, it’d look something like:

 

   select subject, count(*) from notes where date_taken > sysdate-(365*2) group 
by subject;

 

There’s some additional WHERE clause stuff for filtering, but on average, the 
number of results shouldn’t be large.  

 

Any guidance on building up more complex queries like this?  The documentation 
is semi-helpful, but the examples it gives for usage are usually very 
simplistic.

 

Eric Ladner

Systems Analyst

[email protected]

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Noob query question..

Reply via email to