There are ways to prevent that.. You could have a look at CPF as a very robust way of processing files, though it is usually initiated after doc-insert, not before. It also doesn’t necessarily prevent queue overload, but you could always increase the queue size if necessary.
Taskbot is a very good library that takes a very smart approach to spawn tasks without flooding the task server queue. (https://github.com/mblakele/taskbot) And Sam mentioned using external applications to push the processing. DMSDK is the latest option, and would work well in combination with something like Apache Camel to monitor external folder, and pushing whatever changes when they happen, rather than in a scheduled way.. Cheers, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of "Ladner, Eric (Eric.Ladner)" <eric.lad...@chevron.com<mailto:eric.lad...@chevron.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Tuesday, August 22, 2017 at 10:33 PM To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: Re: [MarkLogic Dev General] Large job processing question. Is it smart enough not to spawn 100,000 jobs at once and swamp the system? Eric Ladner Systems Analyst eric.lad...@chevron.com<mailto:eric.lad...@chevron.com> From: general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com> [mailto:general-boun...@developer.marklogic.com] On Behalf Of Geert Josten Sent: August 22, 2017 13:59 To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Large job processing question. Hi Eric, Personally, I would probably let go of the all-docs-at-once approach, and spawn processes for each input (sub)folder, and potentially for batches or individual files in any folder as well. Same for the existing documents, spawn a process for batches or individual docs that check if they still exist. If you make them append logs to the documents or their properties, you can gather reports about changes afterwards if needed. Cheers, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of "Ladner, Eric (Eric.Ladner)" <eric.lad...@chevron.com<mailto:eric.lad...@chevron.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Tuesday, August 22, 2017 at 4:36 PM To: "general@developer.marklogic.com<mailto:general@developer.marklogic.com>" <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: [MarkLogic Dev General] Large job processing question. We have some large jobs (ingestion and validation of unstructured documents) that have timeout issues. The way the jobs are structured is structured is that the first job checks that all the existing documents are valid (still exists on the file system). It does this in two steps: 1) gather all documents to be validated from the DB 2) check that list against the file system. The second job is: 1) the filesystem is traversed to find any new documents (or that have been modified in the last X days), 2) those new/modified documents are ingested. The problem in the second step is there could be tens of thousands of documents in a hundred thousand folders (don’t ask). The job will typically time out after an hour during the “go find all the new documents” phase. I’m trying to find out if there’s a way to re-structure the job so that it runs faster and doesn’t time out, or maybe breaks up the task into different parts that run in parallel or something. Any thoughts welcome. Eric Ladner Systems Analyst eric.lad...@chevron.com<mailto:eric.lad...@chevron.com>
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general