We monitor an ingestion process by polling MarkLogic and display the status in
a browser UI. Near the end of ingestion and until in-memory stands are written
to disk, one of these queries typically takes 5+ secs to run. However, once the
stands are written to disk, the same query runs in under 50 ms.
The ingestion process may insert and delete 10,000-50,000 documents. Of those
about a dozen will contain metadata related to the import job, and they are
marked up like:
<metadata-items>
<type1>valueA</type1>
<type2>valueB</type2>
...
There are four types total, each type is associated with an element range
index, and each metadata document typically contains 10,000-80,000 of these
type elements. This is the query:
cts:count-aggregate(
cts:element-reference(xs:QName('type1'),...,xs:QName('type4')),
"item-frequency",
cts:directory-query($content-dir, 'infinity'))
To possibly isolate the under-the-hood process responsible for counting index
entries, we compared performance of this query to
cts:frequency(cts:values(*[same parameters as above]*)), but there was a
disparity there too. During ingestion, the cts:frequency()-based query is more
than 5x as fast as the count-aggregate() query; after the stands have been
written to disk, count-aggregate() is 20x faster.
MarkLogic documentation explains that in-memory stands are optimized for
ingestion speed and on-disk stands are optimized for queries, but I am
surprised to see a 100x difference in performance. Is this expected? A
pathological use case? Possibly performance bug? And in either case, I'm
confused why these two methods of counting the same index frequency values
would be so different, depending on the state of the stands.
Any thoughts?
-Will
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general