Hi Karl, I tried to use the carrydown mechanism to perform the delete of children documents but I am facing a problem:
During the first crawl, the connector registers children documents of a document as carrydown data in the processDocuments method through the activities.addDocumentReference method, and all is working well. During a delta crawl, the addSeedDocuments method declares deleted parent documents, but in the processDocuments, although I am able to retrieve the child document ids of the parent document thanks to the carrydown data, I am unable to delete them. My guess is that the ids I want to delete have not been declared in the addSeedDocuments method. If this is correct, is there any way to change this behavior ? Otherwise, is there another way to do things ? As I cannot retrieve carrydown data in the addSeedDocuments I seem to be in a dead-end. Julien -----Message d'origine----- De : Julien Massiera <julien.massi...@francelabs.com> Envoyé : lundi 9 mars 2020 14:24 À : dev@manifoldcf.apache.org Objet : Re: Delta deletion Yes I consider the confluence connector complete. As you suggest, I will try to use the "carrydown" mechanism to do what I want. Thanks, Julien On 09/03/2020 13:59, Karl Wright wrote: > Do you consider the confluence connector in the branch complete? > If so I'll look at it as time permits later today. > > As far as your proposal is concerned, maintaining lists of > dependencies for all documents is quite expensive. We do this for hop > counting and we basically tell people to only use it if they must, > because of the huge amount of database overhead involved. We also > maintain "carrydown" data which is accessible during document > processing. It is typically used for ingestion, but maybe you could > use that for a signal that child documents should delete themselves or > something. > > Major crawling model changes are a gigantic effort; there are always > many things to consider and many problems encountered that need to be > worked around. If you are concerned simply with the load on your API > to handle deletions, I'd suggest using one of the existing mechanisms > for reducing that. But I can see no straightforward way to > incrementally add dependency deletion to the current framework. > > Karl > > > On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera < > julien.massi...@francelabs.com> wrote: > >> Hi Karl, >> >> Now that I finished the confluence connector, I am getting back to >> the other one I was working on, and it would greatly help me to have >> your thoughts on my proposal below. >> >> Thanks, >> Julien >> >> On 02/03/2020 16:40, julien.massi...@francelabs.com wrote: >>> Hi Karl, >>> >>> Thanks for your answer. >>> >>> Your explanations validate what I was anticipating on the way MCF is >> currently implementing its model. As you stated, this does mean that >> in order to use the _DELETE model properly, the seeding process has >> to provide the complete list of deleted documents. >>> Yet wouldn't it be a useful improvement to update the >> activities.deleteDocument method (or create an additional delete >> method) so that it automatically – and optionnaly - removes the >> referenced documents of a document Id ? >>> For instance, since the activities.addDocumentReference method >>> already >> asks the document identifier of the "parent" document, couldn’t we >> maintain in postgres a list of "child ids" and use it during the >> delete process to delete them ? >>> This is very useful in the use case I already described but I am >>> sure it >> would be useful for other type of connectors and/or future >> connectors. The benefits of such modification increase with the number of >> crawled documents. >>> Here is an illustration of the benefits of this MCF modification: >>> >>> With my current connector, if my first crawl ingests 1M documents >>> and on >> the delta crawl only 1 document that has 2 children is deleted, it >> must rely on the processDocument method to check the version of each >> of the 1M documents to figure out and delete the 3 concerned ones (so >> at least 1M calls to the API of the targeted repository). With the >> suggested optional modification, the seeding process would use the >> delta API of the targeted repository and declare the parent document >> (only one API call), then the processDocuments method would be >> triggered only one time to check the version of the document (another >> one API call), figure out that it does not exists anymore and delete it with >> its 2 children. Its 2 API calls vs 1M... >> even if on framework side we have one more request to perform to >> postgres, I think it worth the processing time. >>> What do you think ? >>> >>> Julien >>> >>> -----Message d'origine----- >>> De : Karl Wright <daddy...@gmail.com> Envoyé : samedi 29 février >>> 2020 15:51 À : dev <dev@manifoldcf.apache.org> Objet : Re: Delta >>> deletion >>> >>> Hi Julien, >>> >>> First, ALL models rely on individual existence checks for documents. >> That is, when your connector fetches a deleted document, the >> framework has to be told that the document is gone, or it will not be >> removed. There is no "discovery" process for deleted documents other >> than seeding (and only when the model includes _DELETE). >>> The upshot of this is that IF your seeding method does not return >>> all >> documents that have been removed THEN it cannot be a _DELETE model. >>> I hope this helps. >>> >>> Karl >>> >>> >>> On Sat, Feb 29, 2020 at 8:10 AM <julien.massi...@francelabs.com> wrote: >>> >>>> Hi dev community, >>>> >>>> >>>> >>>> I am trying to develop a connector for an API that exposes a >>>> hierarchical arborescence of documents: each document can have >>>> children >> documents. >>>> During the init crawl, the child documents are referenced in the >>>> MCF connector through the method >>>> activities.addDocumentRefenrece(childDocumentIdentifier, >>>> parentDocumentIdentifier, parentDataNames, parentDataValues) >>>> >>>> The API is able to provide delta modifications/deletions from a >>>> provided date but, when a document that has children is deleted, >>>> the API only returns the id of the document, not its children. On >>>> the MCF connector side, I thought that, as I have referenced the >>>> children, by deleting the parent document all its children would be >>>> deleted with it, but it appears that it is not the case. >>>> >>>> So my question is : did I miss something ? Is there another way to >>>> perform delta deletions ? Unfortunately if I don't find a way to >>>> solve this issue, I will not be able to take advantage of the delta >>>> feature and thus I will have to use the "add_modify" connector type >>>> and test every id on a delta crawl to figure out which ids are >>>> missing. This would be a huge loss of performances. >>>> >>>> >>>> >>>> Regards, >>>> >>>> Julien Massiera >>>> >>>> >> -- >> Julien MASSIERA >> Directeur développement produit >> France Labs – Les experts du Search >> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation >> Makers Summit www.francelabs.com >> >> -- Julien MASSIERA Directeur développement produit France Labs – Les experts du Search Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers Summit www.francelabs.com