Hi Jörg, 4:45h for 5 mln records comes down to 300 doc inserts per second. That sounds reasonable for a single host. You can get is a little higher, like 1000/sec, provided your IO is fast enough, for instance with SSD. The -fastload option will provide the best performance gain, as it will prevent MarkLogic double checking for duplicates in other forests. If you need to scale well beyond 300-1000/sec, just scale out horizontally, and make sure MLCP gets enough threads to target all hosts within a cluster. MLCP will target all hosts within a cluster automatically..
Cheers, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of Jörg Teubert <joerg.teub...@lambdawerk.com<mailto:joerg.teub...@lambdawerk.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Wednesday, April 20, 2016 at 11:48 AM To: "general@developer.marklogic.com<mailto:general@developer.marklogic.com>" <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: [MarkLogic Dev General] Evaluation: Loading large csv files via mlcp Dear Users and Developers, I'm a starter with Marklogic and currently working on an evaluation of ML for use in our business. It is needed to import periodicly large csv files of data provided by government in USA. The file contain a headline with the names of the fields and then many records with the content. Currently we are not really happy with the executuion time of a the import via mlcp. Some info to the file to import: - ca 5 million records - 329 fields per record - about 5 GB of data At the import one xml document per record is created in ML. Functional it looks fine so far. I used a mlcp call like this: >mlcp-8.0-5/bin/mlcp.sh import -host marklogic -port 8383 -username user >-password password -input_file_type delimited_text -document_type xml >-delimited_root_name root -delimiter "," -output_uri_prefix /testImport/ >-output_uri_suffix .xml -input_file_path largeFile.csv An import with this runs around 4 hours and 45 minutes which is to much in our expectation. My questions are now: - Are there any options for mlcp to decrease the execution time? I tried some options like splitting the input file like described in the user guide but without increasing performance. More the opposite. - How are the experiences with such large csv file to import into ML? - Are there other possibilities to import such large files? I think I need a hint in this case. If other informations needed please let me know. Thank you in advance. Best Regards, Jörg Teubert
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general