Re: [MarkLogic Dev General] Large job processing question.

Ladner, Eric (Eric.Ladner) Wed, 23 Aug 2017 06:00:18 -0700

That's good stuff!  Thanks!  It'll take me a while to digest it, but I 
appreciate the info.

Eric Ladner
Systems Analyst
[email protected]<mailto:[email protected]>

From: [email protected] 
[mailto:[email protected]] On Behalf Of Sam Mefford
Sent: August 22, 2017 16:18
To: MarkLogic Developer Discussion <[email protected]>
Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Large job processing 
question.

We generally write external applications for long-running jobs.  Java is a 
popular language for such jobs, and our Data Movement 
SDK<https://developer.marklogic.com/learn/data-movement-sdk> (DMSDK) is written 
specifically to support such jobs.  You'd still need to write the Java code to 
traverse the filesystem, but DMSDK will help with writing lots of documents or 
traversing (or transforming) all documents matching a query.

Also, I don't know how helpful this is, but I have a write-up of a design 
pattern that seems related.  Feedback is appreciated.

Incremental Load Sidecar 
Pattern<https://wiki.marklogic.com/display/ENGINEERING/Incremental+Load+Sidecar+Pattern>

Background

Data integration involves combining data into a target system from multiple 
data sources for applications which require a unified view of the data.  In 
most cases the source data continues to grow and evolve, so updates from 
sources must be regularly incorporated into the target system.  These updates 
include new documents, updated documents, and deleted documents.

Scenario

In many cases the source data is too large to ingest completely every time.  
This pattern addresses the more difficult scenario where incremental loads are 
required to include only the updates.

Additionally, many source systems offer no way to track deleted documents.  
This pattern addresses the more difficult scenario where the source system can 
provide a list of all current document uris, but cannot provide any information 
about deleted documents.

Code

This solution is captured in an example for a JDBC datasource in Java Client 
API Cookbook Example 
IncrementalLoadFromJdbc<https://github.com/marklogic/java-client-api/blob/develop/src/main/java/com/marklogic/client/example/cookbook/datamovement/IncrementalLoadFromJdbc.java>.

Solution
Step 1

The source documents are read by a Java application directly from the source 
system and a hashcode is generated in memory for each document.  The 
application then gets from the MarkLogic Server (the target system) the 
hashcode for any documents already in MarkLogic.  Documents are not updated if 
they are already in MarkLogic with a hashcode matching the one generated from 
the source document content, however their sidecar document is (see below). 
When a document is found to be missing from MarkLogic or different in MarkLogic 
(the hashcodes don't match), the source document is written to MarkLogic 
Server.  This is all done in batches to reduce overhead on the application, 
source, and target systems.  In addition, batches are processed in the 
application in multiple threads and against multiple MarkLogic hosts to fully 
utilize the MarkLogic cluster.  The simplest way to do this is using the Data 
Movement SDK<https://developer.marklogic.com/learn/data-movement-sdk>.

Step 2

Any document written to MarkLogic Server is written a "sidecar" document 
containing metadata including the document uri, a hashcode and a jobName.  The 
sidecar document has a collection representing the data source.  The hascode is 
generated based on the source document contents.  The hascode is generated 
using any hash 
algorithm<https://en.wikipedia.org/wiki/Cryptographic_hash_function> that is 
consistent when the source document hasn't changed and different any time the 
source document has changed.  The jobName is any id or timestamp representing 
the last job which validated the document, and should differ from previous job 
runs.  This sidecar document is updated with each job run to reflect the latest 
jobName.

Step 3

As the last step of a job run, a query returns all sidecar files with the 
collection for this datasource but a jobName different than the current jobName 
which indicates these documents are in MarkLogic but were missing from this job 
run and are therefore not in the datasource.  After confirming that these 
documents are legitimately not in the datasource, they are deleted from 
MarkLogic Server.  This is how we stay up-to-date with deletes when the source 
system offers no way to track deleted documents.

Solution Alternative

If your scenario allows you to load all the documents each time, do that 
because it's simpler.  Simply delete in the target all data from that one 
source then reload the latest data from that source.  This addresses new 
documents, updated documents, and deleted documents.

Solution Adjustment 1

The sidecar document can be written to a different MarkLogic database, cluster, 
or non-MarkLogic system (including the file system).  This will reduce the read 
load on the database with the actual document contents.  This also opens more 
options to write sidecar to a database with a different configuration including 
forests on less expensive storage.

Solution Adjustment 2

For systems that offer a way to track deleted documents, use this instead of 
step 3.  Get the list of uris of source documents deleted since the last job 
run.  Delete those documents (and associated sidecar files) from MarkLogic 
Server.

Solution Adjustment 3

The source documents can be read from a staging area containing at least the 
uri and the up-to-date hashcode for each document.  This will reduce the read 
load on the source system to only documents found to be missing from MarkLogic 
or updated from what is in MarkLogic.

Sam Mefford
Senior Engineer
MarkLogic Corporation
[email protected]<mailto:[email protected]>
Cell: +1 801 706 9731
www.marklogic.com<http://www.marklogic.com>

This e-mail and any accompanying attachments are confidential. The information 
is intended solely for the use of the individual to whom it is addressed. Any 
review, disclosure, copying, distribution, or use of this e-mail communication 
by others is strictly prohibited. If you are not the intended recipient, please 
notify us immediately by returning this message to the sender and delete all 
copies. Thank you for your cooperation.
________________________________
From: 
[email protected]<mailto:[email protected]>
 [[email protected]] on behalf of Ladner, Eric 
(Eric.Ladner) [[email protected]]
Sent: Tuesday, August 22, 2017 8:36 AM
To: [email protected]<mailto:[email protected]>
Subject: [MarkLogic Dev General] Large job processing question.
We have some large jobs (ingestion and validation of unstructured documents) 
that have timeout issues.
The way the jobs are structured is structured is that the first job checks that 
all the existing documents are valid (still exists on the file system).  It 
does this in two steps:

     1) gather all documents to be validated from the DB
     2) check that list against the file system.

The second job is:
     1) the filesystem is traversed to find any new documents (or that have 
been modified in the last X days),
     2) those new/modified documents are ingested.

The problem in the second step is there could be tens of thousands of documents 
in a hundred thousand folders (don't ask).  The job will typically time out 
after an hour during the "go find all the new documents" phase.  I'm trying to 
find out if there's a way to re-structure the job so that it runs faster and 
doesn't time out, or maybe breaks up the task into different parts that run in 
parallel or something.  Any thoughts welcome.

Eric Ladner
Systems Analyst
[email protected]<mailto:[email protected]>

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Large job processing question.

Reply via email to