Hi Geert, Thank you for your quick reply. At the moment I'm not sure if we are on a SSD or classical HD. I will evaluate this and maybe retry.
For the -fastload option I'm not sure if we meet a restriction described in the guide in live business later so that we run into difficulties here. But we are evaluating currently so I will give it a try. Are there other possibilities outside MLCP to import so many documents from csv? Thank you. Regards, Jörg 2016-04-20 12:03 GMT+02:00 Geert Josten <geert.jos...@marklogic.com>: > Hi Jörg, > > 4:45h for 5 mln records comes down to 300 doc inserts per second. That > sounds reasonable for a single host. You can get is a little higher, like > 1000/sec, provided your IO is fast enough, for instance with SSD. The > -fastload option will provide the best performance gain, as it will prevent > MarkLogic double checking for duplicates in other forests. If you need to > scale well beyond 300-1000/sec, just scale out horizontally, and make sure > MLCP gets enough threads to target all hosts within a cluster. MLCP will > target all hosts within a cluster automatically.. > > Cheers, > Geert > > From: <general-boun...@developer.marklogic.com> on behalf of Jörg Teubert > <joerg.teub...@lambdawerk.com> > Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com> > Date: Wednesday, April 20, 2016 at 11:48 AM > To: "general@developer.marklogic.com" <general@developer.marklogic.com> > Subject: [MarkLogic Dev General] Evaluation: Loading large csv files via > mlcp > > Dear Users and Developers, > > I'm a starter with Marklogic and currently working on an evaluation of ML > for use in our business. It is needed to import periodicly large csv files > of data provided by government in USA. The file contain a headline with the > names of the fields and then many records with the content. > > Currently we are not really happy with the executuion time of a the import > via mlcp. > Some info to the file to import: > - ca 5 million records > - 329 fields per record > - about 5 GB of data > > At the import one xml document per record is created in ML. Functional it > looks fine so far. I used a mlcp call like this: > >mlcp-8.0-5/bin/mlcp.sh import -host marklogic -port 8383 -username user > -password password -input_file_type delimited_text -document_type xml > -delimited_root_name root -delimiter "," -output_uri_prefix /testImport/ > -output_uri_suffix .xml -input_file_path largeFile.csv > > An import with this runs around 4 hours and 45 minutes which is to much in > our expectation. > My questions are now: > - Are there any options for mlcp to decrease the execution time? I tried > some options like splitting the input file like described in the user guide > but without increasing performance. More the opposite. > - How are the experiences with such large csv file to import into ML? > - Are there other possibilities to import such large files? > > I think I need a hint in this case. > If other informations needed please let me know. > > Thank you in advance. > Best Regards, > Jörg Teubert > > _______________________________________________ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > >
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general