RE: Delta deletion

julien.massiera Thu, 26 Mar 2020 08:52:39 -0700

Hi Karl,

I tried to use the carrydown mechanism to perform the delete of children 
documents but I am facing a problem:


During the first crawl, the connector registers children documents of a 
document as carrydown data in the processDocuments method through the 
activities.addDocumentReference method, and all is working well.
During a delta crawl, the addSeedDocuments method declares deleted parent 
documents, but in the processDocuments, although I am able to retrieve the 
child document ids of the parent document thanks to the carrydown data, I am 
unable to delete them. My guess is that the ids I want to delete have not been 
declared in the addSeedDocuments method. If this is correct, is there any way 
to change this behavior ?
Otherwise, is there another way to do things ? As I cannot retrieve carrydown 
data in the addSeedDocuments I seem to be in a dead-end.

Julien


-----Message d'origine-----
De : Julien Massiera <julien.massi...@francelabs.com> 
Envoyé : lundi 9 mars 2020 14:24
À : dev@manifoldcf.apache.org
Objet : Re: Delta deletion

Yes I consider the confluence connector complete.

As you suggest, I will try to use the "carrydown" mechanism to do what I want.

Thanks,

Julien

On 09/03/2020 13:59, Karl Wright wrote:
> Do you consider the confluence connector in the branch complete?
> If so I'll look at it as time permits later today.
>
> As far as your proposal is concerned, maintaining lists of 
> dependencies for all documents is quite expensive.  We do this for hop 
> counting and we basically tell people to only use it if they must, 
> because of the huge amount of database overhead involved.  We also 
> maintain "carrydown" data which is accessible during document 
> processing.  It is typically used for ingestion, but maybe you could 
> use that for a signal that child documents should delete themselves or 
> something.
>
> Major crawling model changes are a gigantic effort; there are always 
> many things to consider and many problems encountered that need to be 
> worked around.  If you are concerned simply with the load on your API 
> to handle deletions, I'd suggest using one of the existing mechanisms 
> for reducing that.  But I can see no straightforward way to 
> incrementally add dependency deletion to the current framework.
>
> Karl
>
>
> On Mon, Mar 9, 2020 at 5:53 AM Julien Massiera < 
> julien.massi...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Now that I finished the confluence connector, I am getting back to 
>> the other one I was working on, and it would greatly help me to have 
>> your thoughts on my proposal below.
>>
>> Thanks,
>> Julien
>>
>> On 02/03/2020 16:40, julien.massi...@francelabs.com wrote:
>>> Hi Karl,
>>>
>>> Thanks for your answer.
>>>
>>> Your explanations validate what I was anticipating on the way MCF is
>> currently implementing its model. As you stated, this does mean that 
>> in order to use the _DELETE model properly, the seeding process has 
>> to provide the complete list of deleted documents.
>>> Yet wouldn't it be a useful improvement to update the
>> activities.deleteDocument method (or create an additional delete 
>> method) so that it automatically – and optionnaly - removes the 
>> referenced documents of a document Id ?
>>> For instance, since the activities.addDocumentReference method 
>>> already
>> asks the document identifier of the "parent" document, couldn’t we 
>> maintain in postgres a list of "child ids" and use it during the 
>> delete process to delete them ?
>>> This is very useful in the use case I already described but I am 
>>> sure it
>> would be useful for other type of connectors and/or future 
>> connectors. The benefits of such modification increase with the number of 
>> crawled documents.
>>> Here is an illustration of the benefits of this MCF modification:
>>>
>>> With my current connector, if my first crawl ingests 1M documents 
>>> and on
>> the delta crawl only 1 document that has 2 children is deleted, it 
>> must rely on the processDocument method to check the version of each 
>> of the 1M documents to figure out and delete the 3 concerned ones (so 
>> at least 1M calls to the API of the targeted repository). With the 
>> suggested optional modification, the seeding process would use the 
>> delta API of the targeted repository and declare the parent document 
>> (only one API call), then the processDocuments method would be 
>> triggered only one time to check the version of the document (another 
>> one API call), figure out that it does not exists anymore and delete it with 
>> its 2 children. Its 2 API calls vs 1M...
>> even if on framework side we have one more request to perform to 
>> postgres, I think it worth the processing time.
>>> What do you think ?
>>>
>>> Julien
>>>
>>> -----Message d'origine-----
>>> De : Karl Wright <daddy...@gmail.com> Envoyé : samedi 29 février 
>>> 2020 15:51 À : dev <dev@manifoldcf.apache.org> Objet : Re: Delta 
>>> deletion
>>>
>>> Hi Julien,
>>>
>>> First, ALL models rely on individual existence checks for documents.
>> That is, when your connector fetches a deleted document, the 
>> framework has to be told that the document is gone, or it will not be 
>> removed.  There is no "discovery" process for deleted documents other 
>> than seeding (and only when the model includes _DELETE).
>>> The upshot of this is that IF your seeding method does not return 
>>> all
>> documents that have been removed THEN it cannot be a _DELETE model.
>>> I hope this helps.
>>>
>>> Karl
>>>
>>>
>>> On Sat, Feb 29, 2020 at 8:10 AM <julien.massi...@francelabs.com> wrote:
>>>
>>>> Hi dev community,
>>>>
>>>>
>>>>
>>>> I am trying to develop a connector for an API that exposes a 
>>>> hierarchical arborescence of documents: each document can have 
>>>> children
>> documents.
>>>> During the init crawl, the child documents are referenced in the 
>>>> MCF connector through the method 
>>>> activities.addDocumentRefenrece(childDocumentIdentifier,
>>>> parentDocumentIdentifier, parentDataNames, parentDataValues)
>>>>
>>>> The API is able to provide delta modifications/deletions from a 
>>>> provided date but, when a document that has children is deleted, 
>>>> the API only returns the id of the document, not its children. On 
>>>> the MCF connector side, I thought that, as I have referenced the 
>>>> children, by deleting the parent document all its children would be 
>>>> deleted with it, but it appears that it is not the case.
>>>>
>>>> So my question is : did I miss something ? Is there another way to 
>>>> perform delta deletions ? Unfortunately if I don't find a way to 
>>>> solve this issue, I will not be able to take advantage of the delta 
>>>> feature and thus I will have to use the "add_modify" connector type 
>>>> and test every id on a delta crawl to figure out which ids are 
>>>> missing. This would be a huge loss of performances.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Julien Massiera
>>>>
>>>>
>> --
>> Julien MASSIERA
>> Directeur développement produit
>> France Labs – Les experts du Search
>> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation 
>> Makers Summit www.francelabs.com
>>
>>
--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers 
Summit www.francelabs.com

RE: Delta deletion

Reply via email to