I had a quick look at Jira. I think there is already a ticket which
covers the reqirement of using a sitemap.xml file which is referenced by
robots.txt
https://issues.apache.org/jira/browse/CONNECTORS-1657
I'll update this ticket with infos from the sitemap protocol page
If you wish to add a feature request, please create a CONNECTORS ticket
that describes the functionality you think the connector should have.
Karl
On Wed, Jul 7, 2021 at 9:29 AM h0444xk8 wrote:
> Hi,
>
> yes, that seems to be the reason. In:
>
>
>
Hi,
yes, that seems to be the reason. In:
https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
there is the following code sequence:
else if
The robots parsing does not recognize the "sitemaps" line, which was likely
not in the spec for robots when this connector was written.
Karl
On Wed, Jul 7, 2021 at 3:31 AM h0444xk8 wrote:
> Hi,
>
> I have a general question. Is the Web connector supporting sitemap files
> referenced by the
Hi,
I have a general question. Is the Web connector supporting sitemap files
referenced by the robots.txt? In my use case the robots.txt is stored in
the root of the website and is referencing two compressed sitemaps.
Example of robots.txt
User-Agent: *
Disallow: