Contribution to ManifoldCF webcrawler

2023-09-25 Thread Emeric Bernet-Rollande
Hi Karl and all !
 
I’ve been working on the MCF webcrawler component for our Datafari project, and 
I made some developments that might interest the MCF community.
 
Currently if a website redirects the user with a code 301 or 302 and the « 
limit to seed is checked », the website (the one pointed by the redirection) 
won’t be indexed. We added an option  « Force the inclusion of redirections », 
which will override the previous checkbox if the crawl encounters a redirection.



Would you be interested in getting the patch to integrate it into ManifoldCF? 
The corresponding documentation can be found here: 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors
 
Regards,

Emeric Bernet-Rollande

France Labs – Your knowledge, now
Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5
www.datafari.com



Re: Contribution to ManifoldCF webcrawler

2023-09-25 Thread Furkan KAMACI
Hi Emeric,

First of all, thank you for your effort and suggestion. Do you have a Pull
Request for that improvement?

Kind regards,
Furkan Kamaci

On Mon, Sep 25, 2023 at 10:23 AM Emeric Bernet-Rollande <
emeric.ber...@francelabs.com> wrote:

> Hi Karl and all !
>
>
>
> I’ve been working on the MCF webcrawler component for our Datafari
> project, and I made some developments that might interest the MCF community.
>
>
>
> Currently if a website redirects the user with a code 301 or 302 and the
> « limit to seed is checked », the website (the one pointed by the
> redirection) won’t be indexed. We added an option  « Force the inclusion
> of redirections », which will override the previous checkbox if the crawl
> encounters a redirection.
>
>
>
>
>
> Would you be interested in getting the patch to integrate it into
> ManifoldCF? The corresponding documentation can be found here:
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors
>
>
>
> Regards,
>
>
>
> Emeric Bernet-Rollande
>
>
>
> *France Labs – Your knowledge, now*
>
> Datafari Enterprise Search – Découvrez la version 5 / Discover our version
> 5
> www.datafari.com
>
>
>


RE : Contribution to ManifoldCF webcrawler

2023-09-25 Thread Emeric Bernet-Rollande
Hi,

I opened a Pull Request, right here ! 
https://github.com/apache/manifoldcf/pull/149

Regards,

Emeric Bernet-Rollande

France Labs – Your knowledge, now
Datafari Enterprise Search – Découvrez la version 5 / Discover our version 5
www.datafari.com

De : Furkan KAMACI
Envoyé le :lundi 25 septembre 2023 09:28
À : dev@manifoldcf.apache.org
Cc : olivier.tav...@francelabs.com; France Labs
Objet :Re: Contribution to ManifoldCF webcrawler

Hi Emeric,

First of all, thank you for your effort and suggestion. Do you have a Pull
Request for that improvement?

Kind regards,
Furkan Kamaci

On Mon, Sep 25, 2023 at 10:23 AM Emeric Bernet-Rollande <
emeric.ber...@francelabs.com> wrote:

> Hi Karl and all !
>
>
>
> I’ve been working on the MCF webcrawler component for our Datafari
> project, and I made some developments that might interest the MCF community.
>
>
>
> Currently if a website redirects the user with a code 301 or 302 and the
> « limit to seed is checked », the website (the one pointed by the
> redirection) won’t be indexed. We added an option  « Force the inclusion
> of redirections », which will override the previous checkbox if the crawl
> encounters a redirection.
>
>
>
>
>
> Would you be interested in getting the patch to integrate it into
> ManifoldCF? The corresponding documentation can be found here:
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors
>
>
>
> Regards,
>
>
>
> Emeric Bernet-Rollande
>
>
>
> *France Labs – Your knowledge, now*
>
> Datafari Enterprise Search – Découvrez la version 5 / Discover our version
> 5
> www.datafari.com
>
>
>



[GitHub] [manifoldcf] Soggard opened a new pull request, #149: Wecrawler connector - Add "Force inclusion of redirections" option

2023-09-25 Thread via GitHub


Soggard opened a new pull request, #149:
URL: https://github.com/apache/manifoldcf/pull/149

   The "Force the inclusion of redirection” options allows you to include hosts 
redirected from original seeds. You might want to use this option if the site 
you are crawling is subject to redirections. Note that it is not required if 
the previous option is not checked. Here are the possible behaviors:
   
   - If the user checks the “Include only hosts”, but not the “Force the 
inclusion” option, then the redirected files will be filtered if their new URL 
doesn’t match the seed.
   - If the user checks the Include only hosts, and checks the Force the 
inclusion option, then when the job finds a url that is not in the same domain, 
it is dropped EXCEPT if the url is originated by a 301 or 302 redirection in 
the document queue.
   - If the user does NOT check the include only hosts, but checks the Force 
the inclusion option, then the job will crawl any url found, even if it is 
originated by a 301 or 302 redirection.
   - If the user does not check anything, then the behavior is the same as the 
previous case.
   
   If the admin checks the second option AND if the first option is checked, 
then the job will check any host added in the Set. If a host is subject to 
redirection, then we add the destination URL in the Set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@manifoldcf.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: RE : Contribution to ManifoldCF webcrawler

2023-09-25 Thread Karl Wright
Thanks.
I will have a look at first opportunity.
Karl


On Mon, Sep 25, 2023 at 7:00 AM Emeric Bernet-Rollande <
emeric.ber...@francelabs.com> wrote:

> Hi,
>
> I opened a Pull Request, right here !
> https://github.com/apache/manifoldcf/pull/149
>
> Regards,
>
> Emeric Bernet-Rollande
>
> France Labs – Your knowledge, now
> Datafari Enterprise Search – Découvrez la version 5 / Discover our version
> 5
> www.datafari.com
>
> De : Furkan KAMACI
> Envoyé le :lundi 25 septembre 2023 09:28
> À : dev@manifoldcf.apache.org
> Cc : olivier.tav...@francelabs.com; France Labs
> Objet :Re: Contribution to ManifoldCF webcrawler
>
> Hi Emeric,
>
> First of all, thank you for your effort and suggestion. Do you have a Pull
> Request for that improvement?
>
> Kind regards,
> Furkan Kamaci
>
> On Mon, Sep 25, 2023 at 10:23 AM Emeric Bernet-Rollande <
> emeric.ber...@francelabs.com> wrote:
>
> > Hi Karl and all !
> >
> >
> >
> > I’ve been working on the MCF webcrawler component for our Datafari
> > project, and I made some developments that might interest the MCF
> community.
> >
> >
> >
> > Currently if a website redirects the user with a code 301 or 302 and the
> > « limit to seed is checked », the website (the one pointed by the
> > redirection) won’t be indexed. We added an option  « Force the inclusion
> > of redirections », which will override the previous checkbox if the crawl
> > encounters a redirection.
> >
> >
> >
> >
> >
> > Would you be interested in getting the patch to integrate it into
> > ManifoldCF? The corresponding documentation can be found here:
> >
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1886879745/Web+Connectors
> >
> >
> >
> > Regards,
> >
> >
> >
> > Emeric Bernet-Rollande
> >
> >
> >
> > *France Labs – Your knowledge, now*
> >
> > Datafari Enterprise Search – Découvrez la version 5 / Discover our
> version
> > 5
> > www.datafari.com
> >
> >
> >
>
>