[ https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482114#comment-17482114 ]
Karl Wright commented on CONNECTORS-1695: ----------------------------------------- The interestingMimeTypes are mime types that may be HTML or XHTML, or documents where links cannot be extracted but where the content can be indexed. What actually happens to a given document depends on what is actually in it rather than what the mime type says. The mimetypes are a crude filter only. The code that we'd want to modify would be the extractLinks code: {code} // Now, extract links. // We'll call the "link extractor" series, so we can plug more stuff in over time. boolean indexDocument = extractLinks(documentIdentifier,activities,filter); {code} The code for this method is: {code} /** Code to extract links from an already-fetched document. */ protected boolean extractLinks(String documentIdentifier, IProcessActivity activities, DocumentURLFilter filter) throws ManifoldCFException, ServiceInterruption { ProcessActivityRedirectionHandler redirectHandler = new ProcessActivityRedirectionHandler(documentIdentifier,activities,filter); handleRedirects(documentIdentifier,redirectHandler); if (Logging.connectors.isDebugEnabled() && redirectHandler.shouldIndex() == false) Logging.connectors.debug("Web: Not indexing document '"+documentIdentifier+"' because of redirection"); // For html, we don't want any actions, because we don't do form submission. ProcessActivityHTMLHandler htmlHandler = new ProcessActivityHTMLHandler(documentIdentifier,activities,filter,metaRobotsTagsUsage); handleHTML(documentIdentifier,htmlHandler); if (Logging.connectors.isDebugEnabled() && htmlHandler.shouldIndex() == false) Logging.connectors.debug("Web: Not indexing document '"+documentIdentifier+"' because of HTML robots or content tags prohibiting indexing"); ProcessActivityXMLHandler xmlHandler = new ProcessActivityXMLHandler(documentIdentifier,activities,filter); handleXML(documentIdentifier,xmlHandler); if (Logging.connectors.isDebugEnabled() && xmlHandler.shouldIndex() == false) Logging.connectors.debug("Web: Not indexing document '"+documentIdentifier+"' because of XML robots or content tags prohibiting indexing"); // May add more later for other extraction tasks. return htmlHandler.shouldIndex() && redirectHandler.shouldIndex() && xmlHandler.shouldIndex(); } {code} Note that there are three different parsing attempts made: HTML, XML (which is I believe RSS feeds only at this point) and redirection pages. You could add a fourth. Most of these invoke the fuzzyml parser, which is a bottom-up parser with overrides for specific interesting tags. Even though the sitemap xml is supposedly well formed, you wouldn't want to bet on it, and the fuzzyml parser would be a reasonable technology to do parsing of this kind since it is quite resilient against syntax errors of all kinds. So the trick would be to identify the tag structure of a sitemap document and extend the overrides present for the parser the web connector is using to understand that xml syntax IN ADDITION TO the xhtml it already understands. > Sitemap xml not detected in version 2.17 webconnector > ----------------------------------------------------- > > Key: CONNECTORS-1695 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1695 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 2.17 > Reporter: DK > Priority: Major > > Trying to index sitemap xml and web connector index the whole xml into solr. > Please fix in version 2.17. > If it is any special config that needs to be taken care, please add here and > add in documentation to make it clear. > > Sitemap.xml: > <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> > <sitemap> > <loc>https://<url>/sitemap_1.xml</loc> > <lastmod>2022-01-21T16:04:45Z</lastmod> > </sitemap> > </sitemapindex> > > sitemap_1.xml: > <urlset> > <url> > <loc>https://<docurl></loc> > <lastmod>2018-10-31T11:25:27Z</lastmod> > </url> > </urlset> -- This message was sent by Atlassian Jira (v8.20.1#820001)