[ 
https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Schuch updated CONNECTORS-1571:
--------------------------------------
    Description: 
The Web Crawler Connector extracts the MIME type from the request Content-Type 
header.
Then it truncates the possible {{charset=whatever_encoding}} and lets the 
pipeline check if the resulting MIME type (without the charset) 
{{activities.checkMimeTypeIndexable(contentType);}} should be ingested.

When sending the actual {{RepositoryDocument}} it sets the full MIME type (with 
the charset) in the document. This is no major bug, but a small inconsistency 
since the HttpPoster of the Solr Output Connector performs a "hard" check of 
the MIME type again which can have different outcome than the preceding check 
activity.

I think this was introduced or (better) revealed with CONNECTORS-1482.

Example:
- In my scenario a crawled webpage has Content-Type {{text/html; charset=utf-8}}
- the {{activities.checkMimeTypeIndexable(contentType);}} is called with 
{{text/html}}
- the hard check performed by the Solr Connector is called with {{text/html; 
charset=utf-8}}


  was:
The Web Crawler Connector extracts the MIME type from the request Content-Type 
header.
Then it truncates the possible {{charset=whatever_encoding}} and lets the 
pipeline check if the resulting MIME type (without the charset) 
{{activities.checkMimeTypeIndexable(contentType);}} should be ingested.

When sending the actual {{RepositoryDocument}} it sets the full MIME type (with 
the charset) in the document. This is no major bug, but a small inconsistency 
since the HttpPoster of the Solr Output Connector performs a "hard" check of 
the MIME type again which can have different outcome than the preceding check 
activity.

I think this was introduced or (better) revealed with CONNECTORS-1482.


> Web Crawler Connector checks different MIME type than it is sending down the 
> pipeline
> -------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1571
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.10
>            Reporter: Markus Schuch
>            Priority: Minor
>
> The Web Crawler Connector extracts the MIME type from the request 
> Content-Type header.
> Then it truncates the possible {{charset=whatever_encoding}} and lets the 
> pipeline check if the resulting MIME type (without the charset) 
> {{activities.checkMimeTypeIndexable(contentType);}} should be ingested.
> When sending the actual {{RepositoryDocument}} it sets the full MIME type 
> (with the charset) in the document. This is no major bug, but a small 
> inconsistency since the HttpPoster of the Solr Output Connector performs a 
> "hard" check of the MIME type again which can have different outcome than the 
> preceding check activity.
> I think this was introduced or (better) revealed with CONNECTORS-1482.
> Example:
> - In my scenario a crawled webpage has Content-Type {{text/html; 
> charset=utf-8}}
> - the {{activities.checkMimeTypeIndexable(contentType);}} is called with 
> {{text/html}}
> - the hard check performed by the Solr Connector is called with {{text/html; 
> charset=utf-8}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to