Re: [MarkLogic Dev General] Evaluation: Loading large csv files via mlcp

Geert Josten Wed, 20 Apr 2016 03:03:42 -0700

Hi Jörg,

4:45h for 5 mln records comes down to 300 doc inserts per second. That sounds 
reasonable for a single host. You can get is a little higher, like 1000/sec, 
provided your IO is fast enough, for instance with SSD. The -fastload option 
will provide the best performance gain, as it will prevent MarkLogic double 
checking for duplicates in other forests. If you need to scale well beyond 
300-1000/sec, just scale out horizontally, and make sure MLCP gets enough 
threads to target all hosts within a cluster. MLCP will target all hosts within 
a cluster automatically..


Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Jörg Teubert 
<joerg.teub...@lambdawerk.com<mailto:joerg.teub...@lambdawerk.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Wednesday, April 20, 2016 at 11:48 AM
To: "general@developer.marklogic.com<mailto:general@developer.marklogic.com>" 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: [MarkLogic Dev General] Evaluation: Loading large csv files via mlcp

Dear Users and Developers,

I'm a starter with Marklogic and currently working on an evaluation of ML for 
use in our business. It is needed to import periodicly large csv files of data 
provided by government in USA. The file contain a headline with the names of 
the fields and then many records with the content.

Currently we are not really happy with the executuion time of a the import via 
mlcp.
Some info to the file to import:
- ca 5 million records
- 329 fields per record
- about 5 GB of data

At the import one xml document per record is created in ML. Functional it looks 
fine so far. I used a mlcp call like this:
>mlcp-8.0-5/bin/mlcp.sh import -host marklogic -port 8383 -username user 
>-password password -input_file_type delimited_text -document_type xml 
>-delimited_root_name root -delimiter "," -output_uri_prefix /testImport/ 
>-output_uri_suffix .xml -input_file_path largeFile.csv

An import with this runs around 4 hours and 45 minutes which is to much in our 
expectation.
My questions are now:
- Are there any options for mlcp to decrease the execution time? I tried some 
options like splitting the input file like described in the user guide but 
without increasing performance. More the opposite.
- How are the experiences with such large csv file to import into ML?
- Are there other possibilities to import such large files?

I think I need a hint in this case.
If other informations needed please let me know.

Thank you in advance.
Best Regards,
Jörg Teubert

_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Evaluation: Loading large csv files via mlcp

Reply via email to