We have some large jobs (ingestion and validation of unstructured documents) 
that have timeout issues.
The way the jobs are structured is structured is that the first job checks that 
all the existing documents are valid (still exists on the file system).  It 
does this in two steps:

     1) gather all documents to be validated from the DB
     2) check that list against the file system.

The second job is:
     1) the filesystem is traversed to find any new documents (or that have 
been modified in the last X days),
     2) those new/modified documents are ingested.

The problem in the second step is there could be tens of thousands of documents 
in a hundred thousand folders (don't ask).  The job will typically time out 
after an hour during the "go find all the new documents" phase.  I'm trying to 
find out if there's a way to re-structure the job so that it runs faster and 
doesn't time out, or maybe breaks up the task into different parts that run in 
parallel or something.  Any thoughts welcome.

Eric Ladner
Systems Analyst
[email protected]<mailto:[email protected]>

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to