Re: [MarkLogic Dev General] Evaluation: Loading large csv files via mlcp

Jörg Teubert Wed, 20 Apr 2016 04:03:07 -0700

Hi Geert,

Thank you for your quick reply. At the moment I'm not sure if we are on a
SSD or classical HD. I will evaluate this and maybe retry.


For the -fastload option I'm not sure if we meet a restriction described in
the guide in live business later so that we run into difficulties here. But
we are evaluating currently so I will give it a try.

Are there other possibilities outside MLCP to import so many documents from
csv?

Thank you.
Regards,
Jörg

2016-04-20 12:03 GMT+02:00 Geert Josten <geert.jos...@marklogic.com>:

> Hi Jörg,
>
> 4:45h for 5 mln records comes down to 300 doc inserts per second. That
> sounds reasonable for a single host. You can get is a little higher, like
> 1000/sec, provided your IO is fast enough, for instance with SSD. The
> -fastload option will provide the best performance gain, as it will prevent
> MarkLogic double checking for duplicates in other forests. If you need to
> scale well beyond 300-1000/sec, just scale out horizontally, and make sure
> MLCP gets enough threads to target all hosts within a cluster. MLCP will
> target all hosts within a cluster automatically..
>
> Cheers,
> Geert
>
> From: <general-boun...@developer.marklogic.com> on behalf of Jörg Teubert
> <joerg.teub...@lambdawerk.com>
> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com>
> Date: Wednesday, April 20, 2016 at 11:48 AM
> To: "general@developer.marklogic.com" <general@developer.marklogic.com>
> Subject: [MarkLogic Dev General] Evaluation: Loading large csv files via
> mlcp
>
> Dear Users and Developers,
>
> I'm a starter with Marklogic and currently working on an evaluation of ML
> for use in our business. It is needed to import periodicly large csv files
> of data provided by government in USA. The file contain a headline with the
> names of the fields and then many records with the content.
>
> Currently we are not really happy with the executuion time of a the import
> via mlcp.
> Some info to the file to import:
> - ca 5 million records
> - 329 fields per record
> - about 5 GB of data
>
> At the import one xml document per record is created in ML. Functional it
> looks fine so far. I used a mlcp call like this:
> >mlcp-8.0-5/bin/mlcp.sh import -host marklogic -port 8383 -username user
> -password password -input_file_type delimited_text -document_type xml
> -delimited_root_name root -delimiter "," -output_uri_prefix /testImport/
> -output_uri_suffix .xml -input_file_path largeFile.csv
>
> An import with this runs around 4 hours and 45 minutes which is to much in
> our expectation.
> My questions are now:
> - Are there any options for mlcp to decrease the execution time? I tried
> some options like splitting the input file like described in the user guide
> but without increasing performance. More the opposite.
> - How are the experiences with such large csv file to import into ML?
> - Are there other possibilities to import such large files?
>
> I think I need a hint in this case.
> If other informations needed please let me know.
>
> Thank you in advance.
> Best Regards,
> Jörg Teubert
>
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>

_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Evaluation: Loading large csv files via mlcp

Reply via email to