Re: 'real-time'/frequent ingestion using ManifoldCF

Karl Wright Tue, 02 Jul 2019 10:06:36 -0700

About the only thing I can suggest that would work within the ManifoldCF
framework would be to structure your jobs so that most runs are "Minimal"
runs with "Complete" runs being done every 24 hours.  This should pick up
documents that have been changed or added but will not go through the
process of finding documents that need to be deleted, so it would run much
faster.


Karl


On Tue, Jul 2, 2019 at 11:22 AM R. <[email protected]> wrote:

> Hello,
>
> we are ingesting Documentum system using ManifoldCF and index those
> documents into Elasticsearch. Structure of the Documentum system is a few
> hundreds of cabinets containing together a few millions of documents. We
> have defined about 80 ManifoldCF jobs and each job process some portion of
> the cabinets. The jobs are scheduled to run once a day to get new/updated
> content. This setup works pretty good for us but now we received a specific
> request for almost 'real-time' ingestion and indexing.
>
> The 'real-time' request does not apply to all documents, only to a small
> subset of them. They are not stored in one place but are located across all
> cabinets and we cannot identify their location in advance to create a
> specific job for them. Our idea for the solution is to call DQL query which
> will give us IDs of those documents and process only them. Call this query
> frequently, e.g. each 5 minutes.
>
> Can this be somehow done with ManifoldCF? Is there a way how to pass into
> ManifoldCF Documentum connector only IDs of documents I want to ingest?
> Would be a solution to inherit 'real-time' connector from Documentum
> connector, this connector would compute documents for 'real-time' ingestion
> using DQL query and ingest them. Can be job scheduled to run frequently,
> e.g. in 5 minutes intervals?
>
> Thanks for any advice or suggestion,
> Radko
>

Re: 'real-time'/frequent ingestion using ManifoldCF

Reply via email to