There are ways to prevent that..

You could have a look at CPF as a very robust way of processing files, though 
it is usually initiated after doc-insert, not before. It also doesn’t 
necessarily prevent queue overload, but you could always increase the queue 
size if necessary.

Taskbot is a very good library that takes a very smart approach to spawn tasks 
without flooding the task server queue. (https://github.com/mblakele/taskbot)

And Sam mentioned using external applications to push the processing. DMSDK is 
the latest option, and would work well in combination with something like 
Apache Camel to monitor external folder, and pushing whatever changes when they 
happen, rather than in a scheduled way..

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of "Ladner, Eric (Eric.Ladner)" 
<eric.lad...@chevron.com<mailto:eric.lad...@chevron.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, August 22, 2017 at 10:33 PM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] Large job processing question.

Is it smart enough not to spawn 100,000 jobs at once and swamp the system?

Eric Ladner
Systems Analyst
eric.lad...@chevron.com<mailto:eric.lad...@chevron.com>


From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 [mailto:general-boun...@developer.marklogic.com] On Behalf Of Geert Josten
Sent: August 22, 2017 13:59
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Large job processing 
question.

Hi Eric,

Personally, I would probably let go of the all-docs-at-once approach, and spawn 
processes for each input (sub)folder, and potentially for batches or individual 
files in any folder as well. Same for the existing documents, spawn a process 
for batches or individual docs that check if they still exist. If you make them 
append logs to the documents or their properties, you can gather reports about 
changes afterwards if needed.

Cheers,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of "Ladner, Eric (Eric.Ladner)" 
<eric.lad...@chevron.com<mailto:eric.lad...@chevron.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Tuesday, August 22, 2017 at 4:36 PM
To: "general@developer.marklogic.com<mailto:general@developer.marklogic.com>" 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: [MarkLogic Dev General] Large job processing question.

We have some large jobs (ingestion and validation of unstructured documents) 
that have timeout issues.
The way the jobs are structured is structured is that the first job checks that 
all the existing documents are valid (still exists on the file system).  It 
does this in two steps:

     1) gather all documents to be validated from the DB
     2) check that list against the file system.

The second job is:
     1) the filesystem is traversed to find any new documents (or that have 
been modified in the last X days),
     2) those new/modified documents are ingested.

The problem in the second step is there could be tens of thousands of documents 
in a hundred thousand folders (don’t ask).  The job will typically time out 
after an hour during the “go find all the new documents” phase.  I’m trying to 
find out if there’s a way to re-structure the job so that it runs faster and 
doesn’t time out, or maybe breaks up the task into different parts that run in 
parallel or something.  Any thoughts welcome.

Eric Ladner
Systems Analyst
eric.lad...@chevron.com<mailto:eric.lad...@chevron.com>

_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to