Re: [MarkLogic Dev General] Attribute indexing

David Lee Wed, 20 Aug 2014 04:16:42 -0700

Your are probably running out of "inodes" or run into a limit of directory size,
which will give the same error message if you have large numbers of files.
You can check using "df -I" (assuming you are on a unix-like system which your 
filename suggests)


like this ...
[dlee@z600 ~]$ df -i
Filesystem                  Inodes  IUsed     IFree IUse% Mounted on
/dev/mapper/fedora-root    3276800 231268   3045532    8% /
devtmpfs                   3081228    525   3080703    1% /dev
tmpfs                      3084004      1   3084003    1% /dev/shm
...

The "IUse%" will show if your getting close to the limit (note that unless your 
root you cant get 100%)
Some links:
http://serverfault.com/questions/482173/is-there-any-other-reason-for-no-space-left-on-device
http://stackoverflow.com/questions/466521/how-many-files-can-i-put-in-a-directory
In any case if you get into the multi-thousands of files in a directory you 
will run into performance problems.

There are several ways to solve this, the easiest is to run xsplit iteratively.
First  split by big chunks (depending on how many elements you have, you want 
at most about 1000 files at once0.
so if you had a million elements, this gives you 1000 files of 1000 elements 
each
   xsplit -c 1000

Then process each of these files one by one and load them to ML  and deleting 
(or zipping or otherwise combining) the temp files so at no time do you have 
millions of files ...


with xmlsh you would do something like this :

    import ml=marklogic   # for ml:put below
    mkdir big
    xsplit -c 1000 -o big  file.xml
    for f in big/*.xml ; do
         rm -rf temp
         mkdir temp
         xsplit -o temp $f
         cd temp
              ### Load the 1000 files here then delete them
              ### This uses the marklogic extension for xmlsh
              ### You could use mlcp or other tool here  or zip the files to a 
zip
              ml:put -baseuri /dir/  -maxfiles 100 -maxthreads 5 *.xml
              cd ..
        rm -rf temp
    done


There are more efficient ways to do this but are trickier, try something like 
the above first and see if it helps.
With lots of small files you need to batch them during the insert or it will go 
slowly, thus the arugments to ml:put
but most marklogic tools for upload have options for this.  mlcp is a good one 
and can read directly from zip files so you can instead of doing the upload 
there just zip the files into a zip and delete the temp files.
Then you will have a bunch of zip files to upload instead of a million xml 
files.




From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of irisDeveloper
Sent: Wednesday, August 20, 2014 4:59 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Attribute indexing

Hi David

Thanks for detail explanation . I was working on your guidance of splitting 
large xml file with smaller xmls each carrying  1 transaction, I used xmlsh 
-xsplit utility one you recommended . Igot following error 
/xxx/yyyy/x6548087.xml (No space left on device)  even though it has 70GB of 
space available.


Thanks
Samby


On 08/19/2014 09:27 PM, David Lee wrote:

///////////////

MarkLogic always indexes element values and element-attribute values in a hash 
index. No extra configuration is needed, and it can't be turned off.



Element, attribute, and path range indexes are value indexes. These are only 
needed for fast sorting, inequality lookups, facets, and similar operations.

[DAL:] ///



This is true ... but the poster is experiencing unusual slowness.  For data of 
this size, results should be extremely quick --  but -- the devil is the 
details.    So why is it slow ?







The answer first then the fishing pole.





1) You say you split your documents into smaller files " I split 1.1GB xml file 
into small pieces of xmls (1000 elements each)."



Could you be more precise ?  Ideally you should split the XML file into as 
separate documents that

are logical units ... I am guessing that "1000" was picked just to get the 
files smaller,

if this is true, use 1 not 1000 ... each document should be like a table row 
... contain one (hierarchical) collection of self-contained information.

2) your using a "*" for the element but it must be under another specific 
element.





Indexes and indexed searches default to element/attribute *pairs* ... (not 
precisely, but useful to think this way).

To see what can be indexed efficiently sometimes its useful to see the 
primitive search API's

A good guide is here: 
http://docs.marklogic.com/guide/search-dev/cts_query#chapter



But a quick look for anything starting with cts:   and has the word "query" in 
it is useful.

Go here: http://docs.marklogic.com/guide/search-dev

Click the "XQuery'XSLT" tab and type "cts:" (wait a few secs for  your browser 
to update)



These are the query related primitive APIS's and give a good clue as to whats 
efficient out of the box and what needs help.

Note there is no cts:attribute-query ... only cts:element-attribute items.

This is a close match:



http://docs.marklogic.com/cts:element-attribute-value-query





This is why the suggestion for a Path index (which can explicitly add a new 
index for your attributes).



But why need this ?  Because your xpath has a * for the element name.



/transaction/*[@transInfoRef='ti1']



This won't optimize with the default indexes ...  because the system has no 
idea what element/attribute pair your looking for ...  Add to that  is my 
suspicion that you didn't break down your XML files into individual 
transactions.

So what the server has to do is

1) Find all element/attribute index matches with @transInfoRef='ti1' in all 
elements.

2) Since it is not sure if that element is a direct child of /transaction it 
needs to load every document

3) Load each and every document, re-parse it, and then search to see of the "*" 
associated with the @transInfoRef matches an element  as a direct child of 
/transaction/

4) Return you all documents ... not able to stop until the entire DB is 
searched.





Not so good ...



If you add a path range  index this will optimize, but there's other ways.

For example if you know all possible (or useful) element names which are 
associated with your attribute you can enumerate them in the search.  This will 
allow the search to be resolved 100% from indexes (providing you split your 
documents into 1 transaction per document).



So first do that - resplit your docs down to 1 document per "main XML Element" 
... in your snippet I would guess this is <transaction>

- Ideally don't use more than 1 transaction XML element per document or the 
server will still have to dig into documents where it finds possibly 1 match to 
locate them all.

It can work with bigger groups but its better not to.







An easy way to try (prove/disprove) this is to use QConsole

http://localhost:8000/qconsole/





Now since I don't know your data - I copied the one element in and just had the 
system find the names for me.



You don't want to do this for every query - but it's a way to prove the queries 
can be fast ...

If you don't know at coding time all the element names then either use a path 
index, or you can use this trick,

and store the results ... but that gets more advanced.



Still its worth the try to see what difference this makes.





let $elems := distinct-values(/transaction/*/node-name(.))

return

cts:search(/transaction,

    cts:element-attribute-value-query(

     $elems ,

      xs:QName("transInfoRef"),

      "ti1"))



Try using the Query Console "Profile" to get an idea of what has to load 
documents and what can go to indexes.

For deeper research the Query Plan is useful ...

https://docs.marklogic.com/guide/performance





You may find you can use a slightly different query that doesn't require extra 
tuning ...

or you may find that you need to add a range or path index ...


Finally ... how much data is in your results ?   A fully optimized query tends 
to be liner with the output data size ... if you have a large number of 
matching rows then the results take a long time to get to you.
This is another reason to use the search:search  or cts:search functions which 
are easy to limit the result set and "paginate" them ...
Or you can add [n to m] at the end of your xpath like

(/transaction/*[@transInfoRef eq "ti1"])[1 to 10]

If you are not sure, always limit your results until you discover a good size.

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
d...@marklogic.com<mailto:d...@marklogic.com>
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>







_______________________________________________

General mailing list

General@developer.marklogic.com<mailto:General@developer.marklogic.com>

http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Attribute indexing

Reply via email to