Re: [MarkLogic Dev General] Large job processing question.

Eliot Kimber Tue, 22 Aug 2017 13:44:20 -0700

The Task Manager will queue the jobs. It will only process as many at once as 
there are threads configured for the Task Manager.


 

In my profiling application I’m queueing 10s of 1000s of 10-doc tasks. My Task 
Manager has a maximum queue of 100,000 tasks. If I do a small number of large 
tasks then I quickly exhaust RAM.

 

Cheers,

 

E.

 

--

Eliot Kimber

http://contrext.com

 

 

 

From: <[email protected]> on behalf of "Ladner, Eric 
(Eric.Ladner)" <[email protected]>
Reply-To: MarkLogic Developer Discussion <[email protected]>
Date: Tuesday, August 22, 2017 at 3:33 PM
To: MarkLogic Developer Discussion <[email protected]>
Subject: Re: [MarkLogic Dev General] Large job processing question.

 

Is it smart enough not to spawn 100,000 jobs at once and swamp the system?

 

Eric Ladner

Systems Analyst

[email protected]

 

 

From: [email protected] 
[mailto:[email protected]] On Behalf Of Geert Josten
Sent: August 22, 2017 13:59
To: MarkLogic Developer Discussion <[email protected]>
Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Large job processing 
question.

 

Hi Eric,

 

Personally, I would probably let go of the all-docs-at-once approach, and spawn 
processes for each input (sub)folder, and potentially for batches or individual 
files in any folder as well. Same for the existing documents, spawn a process 
for batches or individual docs that check if they still exist. If you make them 
append logs to the documents or their properties, you can gather reports about 
changes afterwards if needed.

 

Cheers,

Geert

 

From: <[email protected]> on behalf of "Ladner, Eric 
(Eric.Ladner)" <[email protected]>
Reply-To: MarkLogic Developer Discussion <[email protected]>
Date: Tuesday, August 22, 2017 at 4:36 PM
To: "[email protected]" <[email protected]>
Subject: [MarkLogic Dev General] Large job processing question.

 

We have some large jobs (ingestion and validation of unstructured documents) 
that have timeout issues.

The way the jobs are structured is structured is that the first job checks that 
all the existing documents are valid (still exists on the file system).  It 
does this in two steps:  

 

     1) gather all documents to be validated from the DB 

     2) check that list against the file system.

 

The second job is: 

     1) the filesystem is traversed to find any new documents (or that have 
been modified in the last X days), 

     2) those new/modified documents are ingested.

 

The problem in the second step is there could be tens of thousands of documents 
in a hundred thousand folders (don’t ask).  The job will typically time out 
after an hour during the “go find all the new documents” phase.  I’m trying to 
find out if there’s a way to re-structure the job so that it runs faster and 
doesn’t time out, or maybe breaks up the task into different parts that run in 
parallel or something.  Any thoughts welcome.

 

Eric Ladner

Systems Analyst

[email protected]

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Large job processing question.

Reply via email to