That's good stuff! Thanks! It'll take me a while to digest it, but I appreciate the info.
Eric Ladner Systems Analyst [email protected]<mailto:[email protected]> From: [email protected] [mailto:[email protected]] On Behalf Of Sam Mefford Sent: August 22, 2017 16:18 To: MarkLogic Developer Discussion <[email protected]> Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Large job processing question. We generally write external applications for long-running jobs. Java is a popular language for such jobs, and our Data Movement SDK<https://developer.marklogic.com/learn/data-movement-sdk> (DMSDK) is written specifically to support such jobs. You'd still need to write the Java code to traverse the filesystem, but DMSDK will help with writing lots of documents or traversing (or transforming) all documents matching a query. Also, I don't know how helpful this is, but I have a write-up of a design pattern that seems related. Feedback is appreciated. Incremental Load Sidecar Pattern<https://wiki.marklogic.com/display/ENGINEERING/Incremental+Load+Sidecar+Pattern> Background Data integration involves combining data into a target system from multiple data sources for applications which require a unified view of the data. In most cases the source data continues to grow and evolve, so updates from sources must be regularly incorporated into the target system. These updates include new documents, updated documents, and deleted documents. Scenario In many cases the source data is too large to ingest completely every time. This pattern addresses the more difficult scenario where incremental loads are required to include only the updates. Additionally, many source systems offer no way to track deleted documents. This pattern addresses the more difficult scenario where the source system can provide a list of all current document uris, but cannot provide any information about deleted documents. Code This solution is captured in an example for a JDBC datasource in Java Client API Cookbook Example IncrementalLoadFromJdbc<https://github.com/marklogic/java-client-api/blob/develop/src/main/java/com/marklogic/client/example/cookbook/datamovement/IncrementalLoadFromJdbc.java>. Solution Step 1 The source documents are read by a Java application directly from the source system and a hashcode is generated in memory for each document. The application then gets from the MarkLogic Server (the target system) the hashcode for any documents already in MarkLogic. Documents are not updated if they are already in MarkLogic with a hashcode matching the one generated from the source document content, however their sidecar document is (see below). When a document is found to be missing from MarkLogic or different in MarkLogic (the hashcodes don't match), the source document is written to MarkLogic Server. This is all done in batches to reduce overhead on the application, source, and target systems. In addition, batches are processed in the application in multiple threads and against multiple MarkLogic hosts to fully utilize the MarkLogic cluster. The simplest way to do this is using the Data Movement SDK<https://developer.marklogic.com/learn/data-movement-sdk>. Step 2 Any document written to MarkLogic Server is written a "sidecar" document containing metadata including the document uri, a hashcode and a jobName. The sidecar document has a collection representing the data source. The hascode is generated based on the source document contents. The hascode is generated using any hash algorithm<https://en.wikipedia.org/wiki/Cryptographic_hash_function> that is consistent when the source document hasn't changed and different any time the source document has changed. The jobName is any id or timestamp representing the last job which validated the document, and should differ from previous job runs. This sidecar document is updated with each job run to reflect the latest jobName. Step 3 As the last step of a job run, a query returns all sidecar files with the collection for this datasource but a jobName different than the current jobName which indicates these documents are in MarkLogic but were missing from this job run and are therefore not in the datasource. After confirming that these documents are legitimately not in the datasource, they are deleted from MarkLogic Server. This is how we stay up-to-date with deletes when the source system offers no way to track deleted documents. Solution Alternative If your scenario allows you to load all the documents each time, do that because it's simpler. Simply delete in the target all data from that one source then reload the latest data from that source. This addresses new documents, updated documents, and deleted documents. Solution Adjustment 1 The sidecar document can be written to a different MarkLogic database, cluster, or non-MarkLogic system (including the file system). This will reduce the read load on the database with the actual document contents. This also opens more options to write sidecar to a database with a different configuration including forests on less expensive storage. Solution Adjustment 2 For systems that offer a way to track deleted documents, use this instead of step 3. Get the list of uris of source documents deleted since the last job run. Delete those documents (and associated sidecar files) from MarkLogic Server. Solution Adjustment 3 The source documents can be read from a staging area containing at least the uri and the up-to-date hashcode for each document. This will reduce the read load on the source system to only documents found to be missing from MarkLogic or updated from what is in MarkLogic. Sam Mefford Senior Engineer MarkLogic Corporation [email protected]<mailto:[email protected]> Cell: +1 801 706 9731 www.marklogic.com<http://www.marklogic.com> This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. ________________________________ From: [email protected]<mailto:[email protected]> [[email protected]] on behalf of Ladner, Eric (Eric.Ladner) [[email protected]] Sent: Tuesday, August 22, 2017 8:36 AM To: [email protected]<mailto:[email protected]> Subject: [MarkLogic Dev General] Large job processing question. We have some large jobs (ingestion and validation of unstructured documents) that have timeout issues. The way the jobs are structured is structured is that the first job checks that all the existing documents are valid (still exists on the file system). It does this in two steps: 1) gather all documents to be validated from the DB 2) check that list against the file system. The second job is: 1) the filesystem is traversed to find any new documents (or that have been modified in the last X days), 2) those new/modified documents are ingested. The problem in the second step is there could be tens of thousands of documents in a hundred thousand folders (don't ask). The job will typically time out after an hour during the "go find all the new documents" phase. I'm trying to find out if there's a way to re-structure the job so that it runs faster and doesn't time out, or maybe breaks up the task into different parts that run in parallel or something. Any thoughts welcome. Eric Ladner Systems Analyst [email protected]<mailto:[email protected]>
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
