Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

Karl Wright Wed, 07 Jul 2021 07:01:06 -0700

If you wish to add a feature request, please create a CONNECTORS ticket
that describes the functionality you think the connector should have.


Karl


On Wed, Jul 7, 2021 at 9:29 AM h0444xk8 <h0444...@posteo.de> wrote:

> Hi,
>
> yes, that seems to be the reason. In:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java
>
> there is the following code sequence:
>
> else if (lowercaseLine.startsWith("sitemap:"))
>            {
>              // We don't complain about this, but right now we don't
> listen to it either.
>            }
>
> But if I have a look at:
>
>
> https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
>
> a sitemap containing an urlset seems to be handled
>
> else if (localName.equals("urlset") || localName.equals("sitemapindex"))
>        {
>          // Sitemap detected
>          outerTagCount++;
>          return new
>
> UrlsetContextClass(theStream,namespace,localName,qName,atts,documentURI,handler);
>        }
>
> So, my question is: is there another way to handle sitemaps inside the
> Web Crawler?
>
> Cheers Sebastian
>
>
>
>
>
> Am 07.07.2021 12:23 schrieb Karl Wright:
>
> > The robots parsing does not recognize the "sitemaps" line, which was
> > likely not in the spec for robots when this connector was written.
> >
> > Karl
> >
> > On Wed, Jul 7, 2021 at 3:31 AM h0444xk8 <h0444...@posteo.de> wrote:
> >
> >> Hi,
> >>
> >> I have a general question. Is the Web connector supporting sitemap
> >> files
> >> referenced by the robots.txt? In my use case the robots.txt is stored
> >> in
> >> the root of the website and is referencing two compressed sitemaps.
> >>
> >> Example of robots.txt
> >> ------------------------
> >> User-Agent: *
> >> Disallow:
> >> Sitemap: https://www.example.de/sitemap/de-sitemap.xml.gz [1]
> >> Sitemap: https://www.example.de/sitemap/en-sitemap.xml.gz [2]
> >>
> >> When start crawling in „Simple History" there is an error log entry as
> >> follows:
> >>
> >> Unknown robots.txt line: 'Sitemap:
> >> https://www.example.de/sitemap/en-sitemap.xml.gz [2]'
> >>
> >> Is there a general problem with sitemaps at all or with sitemaps
> >> referenced in robots.txt or with compressed sitemaps?
> >>
> >> Best regards
> >>
> >> Sebastian
>
>
> Links:
> ------
> [1] https://www.example.de/sitemap/de-sitemap.xml.gz
> [2] https://www.example.de/sitemap/en-sitemap.xml.gz
>

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

Reply via email to