Re: [MarkLogic Dev General] Attribute indexing

David Lee Tue, 19 Aug 2014 08:58:31 -0700

///////////////

MarkLogic always indexes element values and element-attribute values in a hash 
index. No extra configuration is needed, and it can't be turned off.




Element, attribute, and path range indexes are value indexes. These are only 
needed for fast sorting, inequality lookups, facets, and similar operations.

[DAL:] ///



This is true ... but the poster is experiencing unusual slowness.  For data of 
this size, results should be extremely quick --  but -- the devil is the 
details.    So why is it slow ?







The answer first then the fishing pole.





1) You say you split your documents into smaller files " I split 1.1GB xml file 
into small pieces of xmls (1000 elements each)."



Could you be more precise ?  Ideally you should split the XML file into as 
separate documents that

are logical units ... I am guessing that "1000" was picked just to get the 
files smaller,

if this is true, use 1 not 1000 ... each document should be like a table row 
... contain one (hierarchical) collection of self-contained information.

2) your using a "*" for the element but it must be under another specific 
element.





Indexes and indexed searches default to element/attribute *pairs* ... (not 
precisely, but useful to think this way).

To see what can be indexed efficiently sometimes its useful to see the 
primitive search API's

A good guide is here: 
http://docs.marklogic.com/guide/search-dev/cts_query#chapter



But a quick look for anything starting with cts:   and has the word "query" in 
it is useful.

Go here: http://docs.marklogic.com/guide/search-dev

Click the "XQuery'XSLT" tab and type "cts:" (wait a few secs for  your browser 
to update)



These are the query related primitive APIS's and give a good clue as to whats 
efficient out of the box and what needs help.

Note there is no cts:attribute-query ... only cts:element-attribute items.

This is a close match:



http://docs.marklogic.com/cts:element-attribute-value-query





This is why the suggestion for a Path index (which can explicitly add a new 
index for your attributes).



But why need this ?  Because your xpath has a * for the element name.



/transaction/*[@transInfoRef='ti1']



This won't optimize with the default indexes ...  because the system has no 
idea what element/attribute pair your looking for ...  Add to that  is my 
suspicion that you didn't break down your XML files into individual 
transactions.

So what the server has to do is

1) Find all element/attribute index matches with @transInfoRef='ti1' in all 
elements.

2) Since it is not sure if that element is a direct child of /transaction it 
needs to load every document

3) Load each and every document, re-parse it, and then search to see of the "*" 
associated with the @transInfoRef matches an element  as a direct child of 
/transaction/

4) Return you all documents ... not able to stop until the entire DB is 
searched.





Not so good ...



If you add a path range  index this will optimize, but there's other ways.

For example if you know all possible (or useful) element names which are 
associated with your attribute you can enumerate them in the search.  This will 
allow the search to be resolved 100% from indexes (providing you split your 
documents into 1 transaction per document).



So first do that - resplit your docs down to 1 document per "main XML Element" 
... in your snippet I would guess this is <transaction>

- Ideally don't use more than 1 transaction XML element per document or the 
server will still have to dig into documents where it finds possibly 1 match to 
locate them all.

It can work with bigger groups but its better not to.







An easy way to try (prove/disprove) this is to use QConsole

http://localhost:8000/qconsole/





Now since I don't know your data - I copied the one element in and just had the 
system find the names for me.



You don't want to do this for every query - but it's a way to prove the queries 
can be fast ...

If you don't know at coding time all the element names then either use a path 
index, or you can use this trick,

and store the results ... but that gets more advanced.



Still its worth the try to see what difference this makes.





let $elems := distinct-values(/transaction/*/node-name(.))

return

cts:search(/transaction,

    cts:element-attribute-value-query(

     $elems ,

      xs:QName("transInfoRef"),

      "ti1"))



Try using the Query Console "Profile" to get an idea of what has to load 
documents and what can go to indexes.

For deeper research the Query Plan is useful ...

https://docs.marklogic.com/guide/performance





You may find you can use a slightly different query that doesn't require extra 
tuning ...

or you may find that you need to add a range or path index ...


Finally ... how much data is in your results ?   A fully optimized query tends 
to be liner with the output data size ... if you have a large number of 
matching rows then the results take a long time to get to you.
This is another reason to use the search:search  or cts:search functions which 
are easy to limit the result set and "paginate" them ...
Or you can add [n to m] at the end of your xpath like

(/transaction/*[@transInfoRef eq "ti1"])[1 to 10]

If you are not sure, always limit your results until you discover a good size.

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
d...@marklogic.com
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Attribute indexing

Reply via email to