Contribution to ManifoldCF webcrawler
Hi Karl and all ! I’ve been working on the MCF webcrawler component for our Datafari project, and I made some developments that might interest the MCF community. Currently if a website redirects the user with a code 301 or 302 and the « limit to seed is checked », the website (the one pointed by the redirection) won’t be indexed. We added an option « Force the inclusion of redirections », which will override the previous checkbox if the crawl encounters a redirection. Would you be interested in getting the patch to integrate it into ManifoldCF? The corresponding documentation can be found here: https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors Regards, Emeric Bernet-Rollande France Labs – Your knowledge, now Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5 www.datafari.com
Re: Contribution to ManifoldCF webcrawler
Hi Emeric, First of all, thank you for your effort and suggestion. Do you have a Pull Request for that improvement? Kind regards, Furkan Kamaci On Mon, Sep 25, 2023 at 10:23 AM Emeric Bernet-Rollande < emeric.ber...@francelabs.com> wrote: > Hi Karl and all ! > > > > I’ve been working on the MCF webcrawler component for our Datafari > project, and I made some developments that might interest the MCF community. > > > > Currently if a website redirects the user with a code 301 or 302 and the > « limit to seed is checked », the website (the one pointed by the > redirection) won’t be indexed. We added an option « Force the inclusion > of redirections », which will override the previous checkbox if the crawl > encounters a redirection. > > > > > > Would you be interested in getting the patch to integrate it into > ManifoldCF? The corresponding documentation can be found here: > https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors > > > > Regards, > > > > Emeric Bernet-Rollande > > > > *France Labs – Your knowledge, now* > > Datafari Enterprise Search – Découvrez la version 5 / Discover our version > 5 > www.datafari.com > > >
RE : Contribution to ManifoldCF webcrawler
Hi, I opened a Pull Request, right here ! https://github.com/apache/manifoldcf/pull/149 Regards, Emeric Bernet-Rollande France Labs – Your knowledge, now Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5 www.datafari.com De : Furkan KAMACI Envoyé le :lundi 25 septembre 2023 09:28 À : dev@manifoldcf.apache.org Cc : olivier.tav...@francelabs.com; France Labs Objet :Re: Contribution to ManifoldCF webcrawler Hi Emeric, First of all, thank you for your effort and suggestion. Do you have a Pull Request for that improvement? Kind regards, Furkan Kamaci On Mon, Sep 25, 2023 at 10:23 AM Emeric Bernet-Rollande < emeric.ber...@francelabs.com> wrote: > Hi Karl and all ! > > > > I’ve been working on the MCF webcrawler component for our Datafari > project, and I made some developments that might interest the MCF community. > > > > Currently if a website redirects the user with a code 301 or 302 and the > « limit to seed is checked », the website (the one pointed by the > redirection) won’t be indexed. We added an option « Force the inclusion > of redirections », which will override the previous checkbox if the crawl > encounters a redirection. > > > > > > Would you be interested in getting the patch to integrate it into > ManifoldCF? The corresponding documentation can be found here: > https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors > > > > Regards, > > > > Emeric Bernet-Rollande > > > > *France Labs – Your knowledge, now* > > Datafari Enterprise Search – Découvrez la version 5 / Discover our version > 5 > www.datafari.com > > >
[GitHub] [manifoldcf] Soggard opened a new pull request, #149: Wecrawler connector - Add "Force inclusion of redirections" option
Soggard opened a new pull request, #149: URL: https://github.com/apache/manifoldcf/pull/149 The "Force the inclusion of redirection” options allows you to include hosts redirected from original seeds. You might want to use this option if the site you are crawling is subject to redirections. Note that it is not required if the previous option is not checked. Here are the possible behaviors: - If the user checks the “Include only hosts”, but not the “Force the inclusion” option, then the redirected files will be filtered if their new URL doesn’t match the seed. - If the user checks the Include only hosts, and checks the Force the inclusion option, then when the job finds a url that is not in the same domain, it is dropped EXCEPT if the url is originated by a 301 or 302 redirection in the document queue. - If the user does NOT check the include only hosts, but checks the Force the inclusion option, then the job will crawl any url found, even if it is originated by a 301 or 302 redirection. - If the user does not check anything, then the behavior is the same as the previous case. If the admin checks the second option AND if the first option is checked, then the job will check any host added in the Set. If a host is subject to redirection, then we add the destination URL in the Set. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@manifoldcf.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: RE : Contribution to ManifoldCF webcrawler
Thanks. I will have a look at first opportunity. Karl On Mon, Sep 25, 2023 at 7:00 AM Emeric Bernet-Rollande < emeric.ber...@francelabs.com> wrote: > Hi, > > I opened a Pull Request, right here ! > https://github.com/apache/manifoldcf/pull/149 > > Regards, > > Emeric Bernet-Rollande > > France Labs – Your knowledge, now > Datafari Enterprise Search – Découvrez la version 5 / Discover our version > 5 > www.datafari.com > > De : Furkan KAMACI > Envoyé le :lundi 25 septembre 2023 09:28 > À : dev@manifoldcf.apache.org > Cc : olivier.tav...@francelabs.com; France Labs > Objet :Re: Contribution to ManifoldCF webcrawler > > Hi Emeric, > > First of all, thank you for your effort and suggestion. Do you have a Pull > Request for that improvement? > > Kind regards, > Furkan Kamaci > > On Mon, Sep 25, 2023 at 10:23 AM Emeric Bernet-Rollande < > emeric.ber...@francelabs.com> wrote: > > > Hi Karl and all ! > > > > > > > > I’ve been working on the MCF webcrawler component for our Datafari > > project, and I made some developments that might interest the MCF > community. > > > > > > > > Currently if a website redirects the user with a code 301 or 302 and the > > « limit to seed is checked », the website (the one pointed by the > > redirection) won’t be indexed. We added an option « Force the inclusion > > of redirections », which will override the previous checkbox if the crawl > > encounters a redirection. > > > > > > > > > > > > Would you be interested in getting the patch to integrate it into > > ManifoldCF? The corresponding documentation can be found here: > > > https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors > > > > > > > > Regards, > > > > > > > > Emeric Bernet-Rollande > > > > > > > > *France Labs – Your knowledge, now* > > > > Datafari Enterprise Search – Découvrez la version 5 / Discover our > version > > 5 > > www.datafari.com > > > > > > > >