[ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Schuch reassigned CONNECTORS-1571: ----------------------------------------- Assignee: Markus Schuch > Web Crawler Connector checks different MIME type than it is sending down the > pipeline > ------------------------------------------------------------------------------------- > > Key: CONNECTORS-1571 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1571 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 2.10 > Reporter: Markus Schuch > Assignee: Markus Schuch > Priority: Minor > > The Web Crawler Connector extracts the MIME type from the request > Content-Type header. > Then it truncates the possible {{charset=whatever_encoding}} and lets the > pipeline check if the resulting MIME type (without the charset) > {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. > When sending the actual {{RepositoryDocument}} it sets the full MIME type > (with the charset) in the document. This is no major bug, but a small > inconsistency since the HttpPoster of the Solr Output Connector performs a > "hard" check of the MIME type again which can have different outcome than the > preceding check activity. > I think this was introduced or (better) revealed with CONNECTORS-1482. > Example: > - In my scenario a crawled webpage has Content-Type {{text/html; > charset=utf-8}} > - the {{activities.checkMimeTypeIndexable(contentType);}} is called with > {{text/html}} > - the hard check performed by the Solr Connector is called with {{text/html; > charset=utf-8}} -- This message was sent by Atlassian Jira (v8.3.2#803003)