The Task Manager will queue the jobs. It will only process as many at once as there are threads configured for the Task Manager.
In my profiling application I’m queueing 10s of 1000s of 10-doc tasks. My Task Manager has a maximum queue of 100,000 tasks. If I do a small number of large tasks then I quickly exhaust RAM. Cheers, E. -- Eliot Kimber http://contrext.com From: <[email protected]> on behalf of "Ladner, Eric (Eric.Ladner)" <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Tuesday, August 22, 2017 at 3:33 PM To: MarkLogic Developer Discussion <[email protected]> Subject: Re: [MarkLogic Dev General] Large job processing question. Is it smart enough not to spawn 100,000 jobs at once and swamp the system? Eric Ladner Systems Analyst [email protected] From: [email protected] [mailto:[email protected]] On Behalf Of Geert Josten Sent: August 22, 2017 13:59 To: MarkLogic Developer Discussion <[email protected]> Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Large job processing question. Hi Eric, Personally, I would probably let go of the all-docs-at-once approach, and spawn processes for each input (sub)folder, and potentially for batches or individual files in any folder as well. Same for the existing documents, spawn a process for batches or individual docs that check if they still exist. If you make them append logs to the documents or their properties, you can gather reports about changes afterwards if needed. Cheers, Geert From: <[email protected]> on behalf of "Ladner, Eric (Eric.Ladner)" <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Tuesday, August 22, 2017 at 4:36 PM To: "[email protected]" <[email protected]> Subject: [MarkLogic Dev General] Large job processing question. We have some large jobs (ingestion and validation of unstructured documents) that have timeout issues. The way the jobs are structured is structured is that the first job checks that all the existing documents are valid (still exists on the file system). It does this in two steps: 1) gather all documents to be validated from the DB 2) check that list against the file system. The second job is: 1) the filesystem is traversed to find any new documents (or that have been modified in the last X days), 2) those new/modified documents are ingested. The problem in the second step is there could be tens of thousands of documents in a hundred thousand folders (don’t ask). The job will typically time out after an hour during the “go find all the new documents” phase. I’m trying to find out if there’s a way to re-structure the job so that it runs faster and doesn’t time out, or maybe breaks up the task into different parts that run in parallel or something. Any thoughts welcome. Eric Ladner Systems Analyst [email protected]
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
