Re: How documents are deleted
Hi Karl, thanks for your response, I found in the documentation what I need. Julien On 24/10/2018 16:06, Karl Wright wrote: Hi Julien, This is a complex question and the framework behaves differently depending on the connector model. Please read: https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs Karl On Wed, Oct 24, 2018 at 5:26 AM Julien Massiera < julien.massi...@francelabs.com> wrote: Hi Karl, I am trying to understand the behavior of ManifoldCF during a re-crawl and specially how missing documents are deleted and by which process ? I am focusing on two repository connectors, the JCIFS one and the JDBC one. Here is what I understand so far : In the JCIFS connector, the addSeedDocuments method list all the files found for each configured path. So it seems clear that any previously crawled files that have not been listed during a re-crawl by this method should be deleted. In the JDBC connector, the addSeedDocuments method only list the new or modified documents during a re-crawl (if, of course, the id query is correctly using the starttime and endtime variables). So here, there is a difference between the two connectors. It means that to delete missing documents, the previously crawled ones need to be 'checked' with the version query to detect the documents that must be removed. I am currently unable to tell what is really performed by ManifoldCF to deal with documents to delete and if any of the assumptions I exposed above are correct and/or used. Also, I am really interested to know which part of the code is performing the delete process. Thanks for your help. -- Julien MASSIERA Directeur développement produit France Labs – Les experts du Search Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC www.francelabs.com -- Julien MASSIERA Directeur développement produit France Labs – Les experts du Search Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC www.francelabs.com
Re: How documents are deleted
Hi Julien, This is a complex question and the framework behaves differently depending on the connector model. Please read: https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs Karl On Wed, Oct 24, 2018 at 5:26 AM Julien Massiera < julien.massi...@francelabs.com> wrote: > Hi Karl, > > I am trying to understand the behavior of ManifoldCF during a re-crawl > and specially how missing documents are deleted and by which process ? > > I am focusing on two repository connectors, the JCIFS one and the JDBC > one. Here is what I understand so far : > > In the JCIFS connector, the addSeedDocuments method list all the files > found for each configured path. So it seems clear that any previously > crawled files that have not been listed during a re-crawl by this method > should be deleted. > > In the JDBC connector, the addSeedDocuments method only list the new or > modified documents during a re-crawl (if, of course, the id query is > correctly using the starttime and endtime variables). So here, there is > a difference between the two connectors. It means that to delete missing > documents, the previously crawled ones need to be 'checked' with the > version query to detect the documents that must be removed. > > I am currently unable to tell what is really performed by ManifoldCF to > deal with documents to delete and if any of the assumptions I exposed > above are correct and/or used. Also, I am really interested to know > which part of the code is performing the delete process. > > Thanks for your help. > > -- > Julien MASSIERA > Directeur développement produit > France Labs – Les experts du Search > Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC > www.francelabs.com > >
How documents are deleted
Hi Karl, I am trying to understand the behavior of ManifoldCF during a re-crawl and specially how missing documents are deleted and by which process ? I am focusing on two repository connectors, the JCIFS one and the JDBC one. Here is what I understand so far : In the JCIFS connector, the addSeedDocuments method list all the files found for each configured path. So it seems clear that any previously crawled files that have not been listed during a re-crawl by this method should be deleted. In the JDBC connector, the addSeedDocuments method only list the new or modified documents during a re-crawl (if, of course, the id query is correctly using the starttime and endtime variables). So here, there is a difference between the two connectors. It means that to delete missing documents, the previously crawled ones need to be 'checked' with the version query to detect the documents that must be removed. I am currently unable to tell what is really performed by ManifoldCF to deal with documents to delete and if any of the assumptions I exposed above are correct and/or used. Also, I am really interested to know which part of the code is performing the delete process. Thanks for your help. -- Julien MASSIERA Directeur développement produit France Labs – Les experts du Search Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC www.francelabs.com