Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
I had a quick look at Jira. I think there is already a ticket which covers the reqirement of using a sitemap.xml file which is referenced by robots.txt https://issues.apache.org/jira/browse/CONNECTORS-1657 I'll update this ticket with infos from the sitemap protocol page

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
If you wish to add a feature request, please create a CONNECTORS ticket that describes the functionality you think the connector should have. Karl On Wed, Jul 7, 2021 at 9:29 AM h0444xk8 wrote: > Hi, > > yes, that seems to be the reason. In: > > >

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
Hi, yes, that seems to be the reason. In: https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java there is the following code sequence: else if

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
The robots parsing does not recognize the "sitemaps" line, which was likely not in the spec for robots when this connector was written. Karl On Wed, Jul 7, 2021 at 3:31 AM h0444xk8 wrote: > Hi, > > I have a general question. Is the Web connector supporting sitemap files > referenced by the

Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
Hi, I have a general question. Is the Web connector supporting sitemap files referenced by the robots.txt? In my use case the robots.txt is stored in the root of the website and is referencing two compressed sitemaps. Example of robots.txt User-Agent: * Disallow: