If you wish to add a feature request, please create a CONNECTORS ticket that describes the functionality you think the connector should have.
Karl On Wed, Jul 7, 2021 at 9:29 AM h0444xk8 <h0444...@posteo.de> wrote: > Hi, > > yes, that seems to be the reason. In: > > > https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java > > there is the following code sequence: > > else if (lowercaseLine.startsWith("sitemap:")) > { > // We don't complain about this, but right now we don't > listen to it either. > } > > But if I have a look at: > > > https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java > > a sitemap containing an urlset seems to be handled > > else if (localName.equals("urlset") || localName.equals("sitemapindex")) > { > // Sitemap detected > outerTagCount++; > return new > > UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler); > } > > So, my question is: is there another way to handle sitemaps inside the > Web Crawler? > > Cheers Sebastian > > > > > > Am 07.07.2021 12:23 schrieb Karl Wright: > > > The robots parsing does not recognize the "sitemaps" line, which was > > likely not in the spec for robots when this connector was written. > > > > Karl > > > > On Wed, Jul 7, 2021 at 3:31 AM h0444xk8 <h0444...@posteo.de> wrote: > > > >> Hi, > >> > >> I have a general question. Is the Web connector supporting sitemap > >> files > >> referenced by the robots.txt? In my use case the robots.txt is stored > >> in > >> the root of the website and is referencing two compressed sitemaps. > >> > >> Example of robots.txt > >> ------------------------ > >> User-Agent: * > >> Disallow: > >> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1] > >> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2] > >> > >> When start crawling in „Simple History" there is an error log entry as > >> follows: > >> > >> Unknown robots.txt line: 'Sitemap: > >> https://www.example.de/sitemap/en-sitemap.xml.gz [2]' > >> > >> Is there a general problem with sitemaps at all or with sitemaps > >> referenced in robots.txt or with compressed sitemaps? > >> > >> Best regards > >> > >> Sebastian > > > Links: > ------ > [1] https://www.example.de/sitemap/de-sitemap.xml.gz > [2] https://www.example.de/sitemap/en-sitemap.xml.gz >