Thanks! This helped me prevent the errors from occurring and - as a bonus - significantly sped up my ingestion.
I couldn't use exactly the mlcp command line you suggested, since - in the version of mlcp I'm using - -input_file_type xml isn't allowed, I had to use -input_file_type documents instead. Also, my input files don't need to be split. However, bumping up the threads used (to 30 in my case) made the transaction / timeout complaints go away. And now I'm ingesting 100,000 documents in 12 minutes, rather than one hour. Much better! Regards, Stuart On Fri, Sep 23, 2016 at 3:34 AM, Jain, Abhishek < abhishek.b.j...@capgemini.com> wrote: > Hi Stuart, > > > > MLCP comes with various options, and can be used in various combinations > depending on the file size, memory available and > > Other number of nodes, forest etc. > > > > If you want to try a quick solution you can try this mlcp command : > > *mlcp import -host yourhost -port 8000 -username userName -password > PASSWORD -input_file_type xml -input_file_path TempData -thread_count > -thread_count_per_split 3 -batch_size 200 -transaction_size 20 > -max_split_size 33554432 -split_input true* > > change username, input file type etc accordingly. > > It’s always good to use splits and threads when working with huge dataset. > > Some performance matrix you can consider while using above mlcp : > > 1. In app server settings you can check if connection time out is > set to 0. > > 2. Default spilt size is 32MB, if you can change *-max_split_size > 33554432 *( it take in bytes, if your file is bigger ) > > 3. Make sure split and thread ratio remains 1:2 or 1:3 for example > > If your document size is 10 MB, and your split size is 1000,000 (1 MB) > then 10/1 = 10 splits > > Then you should create 20 or 30 thread for best CPU utilization. > > 4. The above mlcp does well with 150 Million rows, should work for > you as well. > > 5. I assume you have a nice good RAM > 4GB alteast. > > > > Thanks and Regards, > > [image: Email_CBE.gif]Abhishek Jain > > Associate Consultant > > Capgemini India | Hyderabad > > > > *From:* general-boun...@developer.marklogic.com [mailto:general-bounces@ > developer.marklogic.com] *On Behalf Of *Stuart Myles > *Sent:* Thursday, September 22, 2016 11:52 PM > *To:* MarkLogic Developer Discussion > *Subject:* [MarkLogic Dev General] mlcp Transaction Errors - SVC-EXTIME > and XDMP-NOTXN > > > > When I'm loading directories of slightly fewer than 100,000 XML files into > a large MarkLogic instance, I often get timeout and transaction errors. If > I re-run the same directory of files which got those errors, I typically > don't get any errors. > > > > So, I have a few questions: > > > > * Can I get prevent the errors from happening in the first place - e.g. by > tuning MarkLogic parameters or altering my use of mlcp? > > * If I do get errors, what is the best way to get a report on the files > which failed, so I can retry just those ones? Is the best option for me to > write some code to pick out the errors from the log file? And, if so, am I > guaranteed to get all of the files reported? > > > > Some Details > > > > The command line template is > > > > mlcp.sh import -username {1} -password {2} -host localhost -port {4} > -input_file_path {5} -output_uri_replace \"{6},'{7}'\" > > > > Sometimes, the imports run just fine. However, often I get a large number > of SVC-EXTIME errors followed by a XDMP-NOTXN error. For example: > > > > 16/09/22 17:54:03 ERROR mapreduce.ContentWriter: SVC-EXTIME: Time limit > exceeded > > 16/09/22 17:54:03 WARN mapreduce.ContentWriter: Failed document > 029ccd8ac3323658277ca28fead7a73d.0.xml in file:/mnt/ingestion/ > MarkLogicIngestion/smyles/todo/2014_0005.done/ > 029ccd8ac3323658277ca28fead7a73d.0.xml > > 16/09/22 17:54:03 ERROR mapreduce.ContentWriter: SVC-EXTIME: Time limit > exceeded > > 16/09/22 17:54:03 WARN mapreduce.ContentWriter: Failed document > 02eb4562784255e249c4ec3ed472f9aa.1.xml in file:/mnt/ingestion/ > MarkLogicIngestion/smyles/todo/2014_0005.done/ > 02eb4562784255e249c4ec3ed472f9aa.1.xml > > 16/09/22 17:54:04 INFO contentpump.LocalJobRunner: completed 33% > > 16/09/22 17:54:21 ERROR mapreduce.ContentWriter: XDMP-NOTXN: No > transaction with identifier 9076269665213828952 > > > > So far, I'm just rerunning the entire directory again. Most of the time, > it ingests fine on the second attempt. However, I have thousands of these > directories to process. So, I would prefer to avoid getting the errors in > the first place. Failing that, I would like to capture the errors and just > retry the files which failed. > > > > Any help much appreciated. > > > Regards, > > Stuart > > > > > > This message contains information that may be privileged or confidential > and is the property of the Capgemini Group. It is intended only for the > person to whom it is addressed. If you are not the intended recipient, you > are not authorized to read, print, retain, copy, disseminate, distribute, > or use this message or any part thereof. If you receive this message in > error, please notify the sender immediately and delete all copies of this > message. > > _______________________________________________ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > >
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general