Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
I had a quick look at Jira. I think there is already a ticket which covers the reqirement of using a sitemap.xml file which is referenced by robots.txt https://issues.apache.org/jira/browse/CONNECTORS-1657 I'll update this ticket with infos from the sitemap protocol page https://www.sitemaps.org

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
If you wish to add a feature request, please create a CONNECTORS ticket that describes the functionality you think the connector should have. Karl On Wed, Jul 7, 2021 at 9:29 AM h0444xk8 wrote: > Hi, > > yes, that seems to be the reason. In: > > > https://github.com/apache/manifoldcf/blob/0307

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
Hi, yes, that seems to be the reason. In: https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java there is the following code sequence: else if (lowercaseLine.startsWith("

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
The robots parsing does not recognize the "sitemaps" line, which was likely not in the spec for robots when this connector was written. Karl On Wed, Jul 7, 2021 at 3:31 AM h0444xk8 wrote: > Hi, > > I have a general question. Is the Web connector supporting sitemap files > referenced by the rob