[MarkLogic Dev General] How to structure big XML files for fastest access?

Gary Vidal Sun, 08 Jan 2017 17:14:46 -0800

John,

You biggest problem is going to be projecting your data from your data
source efficiently.  The best way is to return exactly what is requested
with no transformation or structure data in a way where you partition your
doc into a logical query pattern to reduce disk I/O on fetch like by
customer or by year.   The first thing to address is your structure?
Ideally canonical XML will be better than the property/@attribute/value
type structure. Avoid attributes at most cases because they rely on a
specific accessors which are slower than element accessors.  In addition, I
find that attributes can cause bleed conditions or false positives without
positional indexes because the query may consider the attribute true for
one element and true for the textual value of another element.  Also you
cannot do text searches on attributes without knowing the element the
attribute is anchored to so this could prevent full-text searches like
cts:word-query.


Here are my some short recommendations

BAD XML in MarkLogic

<attribute type="someAttribute" value="SomeValue"/>

<attribute type="someAttribute">Some Balue</attribute>

Be careful of this pattern concrete outer element/generic inner element
while you can query it correctly, it requires complex wrapping rules and
may require positional indexes.

cts:element-query(xs:QName("company"),cts:element-value-query(xs:QName("name"),"someValue"))

<company>
   <name>Name</name>
   <state>CA</state>
</company>
<author>
    <name> Author Name</name>
   <state>MD</state>
</author>

Prefer
<root>
<companyName>
<companyState>
<authorName>
<authorState>
</root>

If you have N-Occurring patterns then you may have to have positional
indexes to ensure name 1 position from state is the same state for given
name.

BEST Representation is flatten as best as possible. Make every element
representative without any flags or complex attribute queries.
Avoid to many nested conditions such as /a/b/c/a vs a/b/a.  MarkLogic works
on pair of elements when it builds query plan.  So by reducing the path
steps, you save yourself from complex queries and false positives if the
path conditions are too deep to MarkLogic to ensure proper filtering.

<root>
<myAttributeAsAnElement>SomeValue</myAttributeAsAnElement>
<companyName>
<companyState>
<authorName>
<authorState>
</root>


Now as for projecting out N columns will be also challenged by if the
properties are dynamic, at all costs avoid the following patterns.

For Outer For Inner - This will increase your query to x * x since XQuery
will not optimize the inner loop.

for $r in $rows[1 to ...]
return
  <root>
     for $col in $r/col[@attribute = "someAttribute"]
  </root>


Avoid // *[@attribute = ] as this is in affect a table scan for every
property for every row.

Now I am going to assume xquery for your use case and possible solution.

The best XQuery rules I can tell you are :
- always use the most absolute path to access an element/attribute
- if iterating over many columns then make each column call inline as such
- Definitely avoid iterating and xdmp:eval or xdmp:value calls.
- Avoid inner for flwor for selecting properties over rows.

for $row in $rows return element outer-element {$row(col1|col2|col3|col10)}
for $row in $rows return element outer-element
{$row/col1,$row/col2,$row/col3, $row/col10}

Obviously both statements are optimize by xquery engine, but leaves you
writing very concrete query patterns per each request.  The best way I
found to optimize the performance of dynamic query selection to create a
evaluated function by concatenating your column selection in a string then
calling xdmp:value to return a materialized function.  Once the function is
created it will optimize itself after repeated calls

A trivial example would be :

let $results := cts:search(fn:doc(),$some-query-that-returns-results)
let $colpaths := ("/foo/bar","/foo/bat","foo/baz")
(:Create dynamic function  ... Consider what transform you want out and add
to function creation like json or project as canonical  xml:)
let $funct := xdmp:value(fn:concat("function ($result) { element
 outer-element { $result/(" ,fn:string-join($colpaths,"|"),") }}"))
return
  $results/$func(.)

If you find yourself re-using the functions across multiple calls then you
may consider using xdmp:get|set-server-field which will store function in
app-server field so no need to rebuild across requests.

Best way to determine how your query behaves is to use the the profile tab
in qconsole and focus on top 5-10 calls also look for frequency of calls
that seem exponentially larger than the number of documents you are
returning.

Hope this helps

Regards,

Gary Vidal

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] How to structure big XML files for fastest access?

Reply via email to