Hi Eric, Personally, I would probably let go of the all-docs-at-once approach, and spawn processes for each input (sub)folder, and potentially for batches or individual files in any folder as well. Same for the existing documents, spawn a process for batches or individual docs that check if they still exist. If you make them append logs to the documents or their properties, you can gather reports about changes afterwards if needed.
Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of "Ladner, Eric (Eric.Ladner)" <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Tuesday, August 22, 2017 at 4:36 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: [MarkLogic Dev General] Large job processing question. We have some large jobs (ingestion and validation of unstructured documents) that have timeout issues. The way the jobs are structured is structured is that the first job checks that all the existing documents are valid (still exists on the file system). It does this in two steps: 1) gather all documents to be validated from the DB 2) check that list against the file system. The second job is: 1) the filesystem is traversed to find any new documents (or that have been modified in the last X days), 2) those new/modified documents are ingested. The problem in the second step is there could be tens of thousands of documents in a hundred thousand folders (don’t ask). The job will typically time out after an hour during the “go find all the new documents” phase. I’m trying to find out if there’s a way to re-structure the job so that it runs faster and doesn’t time out, or maybe breaks up the task into different parts that run in parallel or something. Any thoughts welcome. Eric Ladner Systems Analyst [email protected]<mailto:[email protected]>
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
