[ 
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482114#comment-17482114
 ] 

Karl Wright commented on CONNECTORS-1695:
-----------------------------------------

The interestingMimeTypes are mime types that may be HTML or XHTML, or documents 
where links cannot be extracted but where the content can be indexed.  What 
actually happens to a given document depends on what is actually in it rather 
than what the mime type says.  The mimetypes are a crude filter only.

The code that we'd want to modify would be the extractLinks code:

{code}
            // Now, extract links.
            // We'll call the "link extractor" series, so we can plug more 
stuff in over time.                                                             
                                                        boolean indexDocument = 
extractLinks(documentIdentifier,activities,filter);
{code}

The code for this method is:

{code}
  /** Code to extract links from an already-fetched document. */                
                                                                                
                                         protected boolean extractLinks(String 
documentIdentifier, IProcessActivity activities, DocumentURLFilter filter)
    throws ManifoldCFException, ServiceInterruption                             
                                                                                
                                         {
    ProcessActivityRedirectionHandler redirectHandler = new 
ProcessActivityRedirectionHandler(documentIdentifier,activities,filter);        
                                                               
handleRedirects(documentIdentifier,redirectHandler);
    if (Logging.connectors.isDebugEnabled() && redirectHandler.shouldIndex() == 
false)                                                                          
                                             Logging.connectors.debug("Web: Not 
indexing document '"+documentIdentifier+"' because of redirection");            
                                                                                
  // For html, we don't want any actions, because we don't do form submission.
    ProcessActivityHTMLHandler htmlHandler = new 
ProcessActivityHTMLHandler(documentIdentifier,activities,filter,metaRobotsTagsUsage);
                                                                     
handleHTML(documentIdentifier,htmlHandler);                                     
                                                                                
                                       if (Logging.connectors.isDebugEnabled() 
&& htmlHandler.shouldIndex() == false)                                          
                                                                                
 Logging.connectors.debug("Web: Not indexing document '"+documentIdentifier+"' 
because of HTML robots or content tags prohibiting indexing");
    ProcessActivityXMLHandler xmlHandler = new 
ProcessActivityXMLHandler(documentIdentifier,activities,filter);
    handleXML(documentIdentifier,xmlHandler);                                   
                                                                                
                                           if 
(Logging.connectors.isDebugEnabled() && xmlHandler.shouldIndex() == false)      
                                                                                
                                      Logging.connectors.debug("Web: Not 
indexing document '"+documentIdentifier+"' because of XML robots or content 
tags prohibiting indexing");
    // May add more later for other extraction tasks.
    return htmlHandler.shouldIndex() && redirectHandler.shouldIndex() && 
xmlHandler.shouldIndex();                                                       
                                                }
{code}

Note that there are three different parsing attempts made: HTML, XML (which is 
I believe RSS feeds only at this point) and redirection pages.  You could add a 
fourth.

Most of these invoke the fuzzyml parser, which is a bottom-up parser with 
overrides for specific interesting tags.  Even though the sitemap xml is 
supposedly well formed, you wouldn't want to bet on it, and the fuzzyml parser 
would be a reasonable technology to do parsing of this kind since it is quite 
resilient against syntax errors of all kinds.

So the trick would be to identify the tag structure of a sitemap document and 
extend the overrides present for the parser the web connector is using to 
understand that xml syntax IN ADDITION TO the xhtml it already understands.


> Sitemap xml not detected in version 2.17 webconnector
> -----------------------------------------------------
>
>                 Key: CONNECTORS-1695
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.17
>            Reporter: DK
>            Priority: Major
>
> Trying to index sitemap xml and web connector index the whole xml into solr.
> Please fix in version 2.17.
> If it is any special config that needs to be taken care, please add here and 
> add in documentation to make it clear.
>  
> Sitemap.xml:
> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9";>
> <sitemap>
> <loc>https://<url>/sitemap_1.xml</loc>
> <lastmod>2022-01-21T16:04:45Z</lastmod>
> </sitemap>
> </sitemapindex>
>  
> sitemap_1.xml:
> <urlset>
> <url>
> <loc>https://<docurl></loc>
> <lastmod>2018-10-31T11:25:27Z</lastmod>
> </url>
> </urlset>



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to