RE: Delta deletion

julien.massiera Mon, 02 Mar 2020 07:41:08 -0800

Hi Karl,

Thanks for your answer.

Your explanations validate what I was anticipating on the way MCF is currently 
implementing its model. As you stated, this does mean that in order to use the 
_DELETE model properly, the seeding process has to provide the complete list of 
deleted documents. 

Yet wouldn't it be a useful improvement to update the activities.deleteDocument 
method (or create an additional delete method) so that it automatically – and 
optionnaly - removes the referenced documents of a document Id ? 

For instance, since the activities.addDocumentReference method already asks the 
document identifier of the "parent" document, couldn’t we maintain in postgres 
a list of "child ids" and use it during the delete process to delete them ? 

This is very useful in the use case I already described but I am sure it would 
be useful for other type of connectors and/or future connectors. The benefits 
of such modification increase with the number of crawled documents. 

Here is an illustration of the benefits of this MCF modification:

With my current connector, if my first crawl ingests 1M documents and on the 
delta crawl only 1 document that has 2 children is deleted, it must rely on the 
processDocument method to check the version of each of the 1M documents to 
figure out and delete the 3 concerned ones (so at least 1M calls to the API of 
the targeted repository). With the suggested optional modification, the seeding 
process would use the delta API of the targeted repository and declare the 
parent document (only one API call), then the processDocuments method would be 
triggered only one time to check the version of the document (another one API 
call), figure out that it does not exists anymore and delete it with its 2 
children. Its 2 API calls vs 1M... even if on framework side we have one more 
request to perform to postgres, I think it worth the processing time.

What do you think ? 

Julien

-----Message d'origine-----
De : Karl Wright <daddy...@gmail.com> 
Envoyé : samedi 29 février 2020 15:51
À : dev <dev@manifoldcf.apache.org>
Objet : Re: Delta deletion

Hi Julien,

First, ALL models rely on individual existence checks for documents.  That is, 
when your connector fetches a deleted document, the framework has to be told 
that the document is gone, or it will not be removed.  There is no "discovery" 
process for deleted documents other than seeding (and only when the model 
includes _DELETE).

The upshot of this is that IF your seeding method does not return all documents 
that have been removed THEN it cannot be a _DELETE model.

I hope this helps.

Karl

On Sat, Feb 29, 2020 at 8:10 AM <julien.massi...@francelabs.com> wrote:

> Hi dev community,
>
>
>
> I am trying to develop a connector for an API that exposes a 
> hierarchical arborescence of documents: each document can have children 
> documents.
>
> During the init crawl, the child documents are referenced in the MCF 
> connector through the method 
> activities.addDocumentRefenrece(childDocumentIdentifier,
> parentDocumentIdentifier, parentDataNames, parentDataValues)
>
> The API is able to provide delta modifications/deletions from a 
> provided date but, when a document that has children is deleted, the 
> API only returns the id of the document, not its children. On the MCF 
> connector side, I thought that, as I have referenced the children, by 
> deleting the parent document all its children would be deleted with 
> it, but it appears that it is not the case.
>
> So my question is : did I miss something ? Is there another way to 
> perform delta deletions ? Unfortunately if I don't find a way to solve 
> this issue, I will not be able to take advantage of the delta feature 
> and thus I will have to use the "add_modify" connector type and test 
> every id on a delta crawl to figure out which ids are missing. This 
> would be a huge loss of performances.
>
>
>
> Regards,
>
> Julien Massiera
>
>

RE: Delta deletion

Reply via email to