Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

Mark Shanks Mon, 24 Oct 2016 17:48:07 -0700

Just as an update. I was already using multiple threads, and this definitely 
improved performance. Having the code on the server and calling the 
export-csv.xqy from the browser would return multiple documents. However, again 
using the java rest api, it would only return a single document. I instead 
connected to the rest endpoint directly using java. This worked, and the speed 
is much better compared with going through the java api. I'm happy with the 
speed now. Thanks to everyone for their suggestions. I am still going to try an 
XDBC endpoint when I get the opportunity as both mlcp and corb uses this 
connection, so maybe it will be even more efficient than the rest interface.

________________________________
From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Geert Josten 
<geert.jos...@marklogic.com>
Sent: Wednesday, 12 October 2016 3:55:39 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

Hi Mark,

For the moment it would be better to avoid the rest api for this. It would 
cancel the streaming effect of the code currently. Just drop this in an xqy, 
and hit that directly. E.g. if you have an /export/export-csv.xqy in your 
modules database, you can make a call to the rest api with 
http://myserver:1234/export/export-csv.xqy…

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Mark Shanks 
<markshanks...@hotmail.com<mailto:markshanks...@hotmail.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, October 11, 2016 at 10:27 PM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

I previously tried not wrapping the output in the <data> element. Running the 
following code:

for $x in cts:search(fn:doc(),cts:and-query((
cts:element-value-query(xs:QName('Department'), 'Sales'), 
cts:element-range-query(xs:QName('Date'), '>', xs:date('2015-01-01')), 
cts:element-range-query(xs:QName('Date'), '<', xs:date('2015-01-03')), 
cts:not-query(cts:element-value-query(xs:QName('Date'), 'NULL')) )), 
'unfiltered' , 0.0) )), 'unfiltered' , 0.0) return 
fn:concat($x//Department,'|',$x//Total,'|',$x//Location'&#10;')

It would return the required documents in the console. However, when I ran the 
same code using the rest api and java using:

 theCall.xquery(query);
out.println(theCall.evalAs(String.class));

It would print out only a single document. I then tried the iterator instead:

theCall.xquery(query);
EvalResultIterator result = theCall.eval();

while (result.hasNext()) {

            out.println(result.next().getString());

}

This did retrieve all of the documents, but was benchmarked as slower - 
presumably because you have so much back and forth between java and the server. 
Is there another way to get the results into java that does not involve the 
iterator but returns all of the documents?

________________________________
From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>

<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Geert Josten 
<geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>>
Sent: Tuesday, 11 October 2016 6:15:57 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

Hi Mark,

The best way to tackle this would be to parallelize output. Have 10 or more 
worker threads consume parts of the total (how many might depend on your 
cluster size, and the total amount of records you need to produce), and make 
each write a CSV on its own.

The cts:search is a good starting point, but if you want to emit CSV anyhow, 
then don’t wrap the results of cts:search in a <data> element. Instead let each 
doc found from cts:search return one or more line-strings, which you don’t join 
either. MarkLogic will insert line-ends between such strings automatically, and 
this way it will allow for streaming.

Doing it right, one worker should be able to produce a 1 mln record csv file in 
a few minute on an average laptop.

At this point, I would worry less about using $x//Department, but assuming $x 
holds the document node, you could write $x/Record/Department. That would 
indeed be a little quicker.

Not sure if Corb(2) can produce CSV, and if it would leverage parallelism in 
the same way as I meant, but it could be worth taking a look at cluster-based 
tools like Hadoop. Apache Camel might allow parallel processing too..

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Mark Shanks 
<markshanks...@hotmail.com<mailto:markshanks...@hotmail.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, October 11, 2016 at 12:27 AM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

MLCP isn't an option as it doesn't provide text-delimited output. 
Text-delimited is a useful format as it allows the data to be pulled into 
practically any other application and with little overhead, unlike xml/json. 
Another problem with xml/json output other than compatibility is the file size. 
When a text-delimited file can be over 30GB with the data we are working with, 
the same data in xml or json becomes absolutely gigantic.

What you say about $x//Department makes sense. If the data is in Marklogic as:

<Record>
      <Department>Sales</Department>
</Record>

What is the best way to get the Department value (i.e., fastest)?
________________________________
From:general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>

<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Sekhon, Navdeep 
<navdeep.sek...@broadridge.com<mailto:navdeep.sek...@broadridge.com>>
Sent: Tuesday, 11 October 2016 6:22:44 AM
To: general@developer.marklogic.com<mailto:general@developer.marklogic.com>
Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

Have you looked into using MLCP? https://developer.marklogic.com/products/mlcp

You can provide your cts query as an option to mlcp, get the documents out of 
ml and do your processing.

Also, this $x//Department is an expensive operation. You should instead give 
the exact xpath.

Regards,

ns/.

-----Original Message-----
From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 [mailto:general-boun...@developer.marklogic.com] On Behalf Of 
general-requ...@developer.marklogic.com<mailto:general-requ...@developer.marklogic.com>
Sent: Monday, October 10, 2016 3:00 PM
To: general@developer.marklogic.com<mailto:general@developer.marklogic.com>
Subject: General Digest, Vol 148, Issue 13

Send General mailing list submissions to
        general@developer.marklogic.com<mailto:general@developer.marklogic.com>

To subscribe or unsubscribe via the World Wide Web, visit
        http://developer.marklogic.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to

general-requ...@developer.marklogic.com<mailto:general-requ...@developer.marklogic.com>

You can reach the person managing the list at

general-ow...@developer.marklogic.com<mailto:general-ow...@developer.marklogic.com>

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of General digest..."

Today's Topics:

   1. How to pull data out of marklogic quickly? (Mark Shanks)

----------------------------------------------------------------------

Message: 1
Date: Mon, 10 Oct 2016 18:43:52 +0000
From: Mark Shanks <markshanks...@hotmail.com<mailto:markshanks...@hotmail.com>>
Subject: [MarkLogic Dev General] How to pull data out of marklogic
        quickly?
To: "General@developer.marklogic.com<mailto:General@developer.marklogic.com>"

<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Message-ID:

<ps1pr03mb1820d1a6779e41e6fff4b16fe6...@ps1pr03mb1820.apcprd03.prod.outlook.com<mailto:ps1pr03mb1820d1a6779e41e6fff4b16fe6...@ps1pr03mb1820.apcprd03.prod.outlook.com>>

Content-Type: text/plain; charset="iso-8859-1"

Hi,

We have a need to pull large amounts of data out of marklogic as quickly as 
possible. I found that doing xquery searches like query-by-example were very 
slow. Using the cts functions led to a big speed increase. However, it isn't 
clear whether my current approach is the optimum, or whether there are other 
better alternatives. Unfortunately, while there is a lot of documentation 
describing many different ways of doing things in marklogic, there seems to be 
very little documentation describing what are the best or most efficient 
approaches (e.g., what if your goal is not only to run a query successfully, 
but to maximize its performance?). At present, I'm using the java api to pull 
documents. I'm using the theCall.xquery(query) function in Java to run custom 
xquery through the rest api. The xquery is as follows:

<data>
for $x in cts:search(fn:doc(),cts:and-query((
cts:element-value-query(xs:QName('Department'), 'Sales'), 
cts:element-range-query(xs:QName('Date'), '>', xs:date('2015-01-01')), 
cts:element-range-query(xs:QName('Date'), '<', xs:date('2015-01-03')), 
cts:not-query(cts:element-value-query(xs:QName('Date'), 'NULL')) )), 
'unfiltered' , 0.0) )), 'unfiltered' , 0.0) return 
fn:concat($x//Department,'|',$x//Total,'|',$x//Location'&#10;')}
</data>

There are indexes on Date and Department. The xquery wraps all of the documents 
in the <data> tags and sends the results to the java program. It then strips 
the <data> tags and prints the results to text file.

I have found that you can run multiple threads in the java that request 
different "chunks" of the data by using the criterions of [1 to 1000000], 
[1000001 to 2000000], etc.

This approach is much faster than our original approach - 12 hours with 8 
threads, rather than 75 hours using query-by-example. However, it is not clear 
if this is the fastest way, or there are further optimizations or better 
approaches. For instance, when pulling the actual elements from the documents, 
I found that having them indexed made no different to performance. Is there a 
way of pulling from the indexes to improve performance? Is there a way to 
specify the elements you want in the cts:search that will improve performance? 
Is there a more efficient way to restrict the search range? Is there 
documentation describing the most efficient approaches to querying marklogic?

Thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://developer.marklogic.com/pipermail/general/attachments/20161010/d2e00150/attachment-0001.html

------------------------------

_______________________________________________
General mailing list
General@developer.marklogic.com<mailto:General@developer.marklogic.com>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

End of General Digest, Vol 148, Issue 13
****************************************

This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, please notify us immediately by e-mail and delete the message and any 
attachments from your system.
_______________________________________________
General mailing list
General@developer.marklogic.com<mailto:General@developer.marklogic.com>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

Reply via email to