I just went through an exercise similar to this with my profiling application.
In my case I’m capturing the output from the profiler for a large number of
processing instances (potentially millions). I’m measuring both raw performance
and also at-scale performance for processing of a large corpus, so it’s not
sufficient to just profile a few cases. We know there is wide variation in
performance for different input documents and we also want to see trends, both
within the data and over time as the data, code, and servers evolve. So I’m
measuring everything.
I want to know which expressions take the most time across all the instances
and get a count. For example, in the longest instances one particular
expression is always the top one but is it the top one in the faster instances?
The information is in the profiler histogram output but it is not ordered by
shallow time (the value I’m interested in), so it’s not as easy as just getting
the first expression for each histogram.
The solution approach, developed for me by Evan Lenz, is to use a trick with
co-occurrence queries where attributes on the same element have a proximity of
zero. If you construct an index over two attributes you can then use
cts:value-co-occurrences() to get all the pairs and then select the ones you
want. This approach also requires that each set be in a separate document (so
that you can limit each call to cts:value-co-occurrences() to a single document
using cts:document-query(). If there were multiple profiling results in a
single document there would be no index-based way to limit
value-co-occurrences() to a single profiling instance.
To enable this I had to post-process the output the MarkLogic profiler to add
attributes to the prof:expression elements with the shallow-time and
expr-source values, which are otherwise within subelements, the result being:
<prof:expression shallow-time="PT0.000002S"
expr-source="fn:exists($node/self::*)" expr-id="2746390622476927112" count="1">
(I used a simple XSLT transform for this part of the processing, applied as I
store my profiling results.)
I then defined attribute range indexes with word positions turned on for the
@shallow-time and @expr-source attributes.
With that I could then do this to find the longest for each profiling instance
(where each profiling instance is stored as a separate document):
let $maps :=
for $uri in cts:uris((), (), cts:collection-query($collection))
let $expression-index :=
cts:element-attribute-reference(xs:QName("prof:expression"),xs:QName("expr-source"))
let $shallow-time-index :=
cts:element-attribute-reference(xs:QName("prof:expression"),xs:QName("shallow-time"))
let $max := cts:max($shallow-time-index, (), cts:document-query($uri))
let $co-occurrences :=
cts:value-co-occurrences(
$expression-index,
$shallow-time-index,
("proximity=0", "map"),
cts:document-query($uri)
)
let $max-co-occurrence :=
for $map in $co-occurrences
let $keys := map:keys($map)
for $key in $keys
return if ($max eq xs:dayTimeDuration(map:get($map, $key)[1]))
then
map:entry($key, map:get($map, $key))
else ()
return $max-co-occurrence
--
Eliot Kimber
http://contrext.com
From: <[email protected]> on behalf of "Ladner, Eric
(Eric.Ladner)" <[email protected]>
Reply-To: MarkLogic Developer Discussion <[email protected]>
Date: Thursday, August 24, 2017 at 4:30 PM
To: MarkLogic Developer Discussion <[email protected]>
Subject: Re: [MarkLogic Dev General] Noob query question..
Thank you. I will play with this in my development environment tomorrow. I
don’t quote see how it’s getting the counts per subject, though.
For reference.. the structure is similar to this:
<note>
<subject>Test Subject</subject>
<date_taken>2017-04-01T15:32:00</date_taken>
<content>Blah, blah</content>
</note>
There would be many notes, obviously and the output would ideally be something
like (not married to that output, but some output showing the counts for each
subject over that time range).
<counts>
<countItem>
<subject>Test Subject</subject>
<count>2</count>
</countItem>
<countItem>
<subject>Subject 2</subject>
<count>4</count>
</countItem>
...
</counts>
Eric Ladner
Systems Analyst
[email protected]
From: [email protected]
[mailto:[email protected]] On Behalf Of Sam Mefford
Sent: August 24, 2017 15:59
To: MarkLogic Developer Discussion <[email protected]>
Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Noob query question..
I should point out that this is not the fastest way to do it. A faster way
would be to index "date-taken" as a dateTime element range index and use
cts:search with cts:element-range-query.
Sam Mefford
Senior Engineer
MarkLogic Corporation
[email protected]
Cell: +1 801 706 9731
www.marklogic.com
This e-mail and any accompanying attachments are confidential. The information
is intended solely for the use of the individual to whom it is addressed. Any
review, disclosure, copying, distribution, or use of this e-mail communication
by others is strictly prohibited. If you are not the intended recipient, please
notify us immediately by returning this message to the sender and delete all
copies. Thank you for your cooperation.
From: [email protected]
[[email protected]] on behalf of Sam Mefford
[[email protected]]
Sent: Thursday, August 24, 2017 2:56 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Noob query question..
XQuery is an extension of XPath. Here's an example in XPath. These things are
easiest to understand if we know the structure of your docs. Suppose I insert:
xdmp:document-insert("test.xml",
<note><date-taken>2015-01-01</date-taken></note>)
I could find the count of docs more than two years old like this:
count(/note[fn:days-from-duration(fn:current-date() - date-taken) > (365 * 2)])
Sam Mefford
Senior Engineer
MarkLogic Corporation
[email protected]
Cell: +1 801 706 9731
www.marklogic.com
This e-mail and any accompanying attachments are confidential. The information
is intended solely for the use of the individual to whom it is addressed. Any
review, disclosure, copying, distribution, or use of this e-mail communication
by others is strictly prohibited. If you are not the intended recipient, please
notify us immediately by returning this message to the sender and delete all
copies. Thank you for your cooperation.
From: [email protected]
[[email protected]] on behalf of Ladner, Eric
(Eric.Ladner) [[email protected]]
Sent: Thursday, August 24, 2017 2:24 PM
To: MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Noob query question..
I’m still rather new to MarkLogic and apparently have a lot to learn. When
doing research on a proof of concept, I ran across a situation that would be
trivial to solve in SQL, but I’m having problems wrapping my head around how to
do that in XQuery. Or, is XQuery even the right place for this?
Basically, the number of notes per subject for any note that’s less than two
years old. If I was to do this in SQL, it’d look something like:
select subject, count(*) from notes where date_taken > sysdate-(365*2) group
by subject;
There’s some additional WHERE clause stuff for filtering, but on average, the
number of results shouldn’t be large.
Any guidance on building up more complex queries like this? The documentation
is semi-helpful, but the examples it gives for usage are usually very
simplistic.
Eric Ladner
Systems Analyst
[email protected]
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general