[jira] [Resolved] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline

2019-08-23 Thread Markus Schuch (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Schuch resolved CONNECTORS-1571.
---
Resolution: Not A Problem

> Web Crawler Connector checks different MIME type than it is sending down the 
> pipeline
> -
>
> Key: CONNECTORS-1571
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Markus Schuch
>Priority: Minor
>
> The Web Crawler Connector extracts the MIME type from the request 
> Content-Type header.
> Then it truncates the possible {{charset=whatever_encoding}} and lets the 
> pipeline check if the resulting MIME type (without the charset) 
> {{activities.checkMimeTypeIndexable(contentType);}} should be ingested.
> When sending the actual {{RepositoryDocument}} it sets the full MIME type 
> (with the charset) in the document. This is no major bug, but a small 
> inconsistency since the HttpPoster of the Solr Output Connector performs a 
> "hard" check of the MIME type again which can have different outcome than the 
> preceding check activity.
> I think this was introduced or (better) revealed with CONNECTORS-1482.
> Example:
> - In my scenario a crawled webpage has Content-Type {{text/html; 
> charset=utf-8}}
> - the {{activities.checkMimeTypeIndexable(contentType);}} is called with 
> {{text/html}}
> - the hard check performed by the Solr Connector is called with {{text/html; 
> charset=utf-8}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline

2019-08-23 Thread Markus Schuch (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914580#comment-16914580
 ] 

Markus Schuch edited comment on CONNECTORS-1571 at 8/23/19 8:15 PM:


This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the 
mime type with parameters, so this is not a problem any more.


was (Author: schuchm):
This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the 
mime type with parameters.

> Web Crawler Connector checks different MIME type than it is sending down the 
> pipeline
> -
>
> Key: CONNECTORS-1571
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Markus Schuch
>Priority: Minor
>
> The Web Crawler Connector extracts the MIME type from the request 
> Content-Type header.
> Then it truncates the possible {{charset=whatever_encoding}} and lets the 
> pipeline check if the resulting MIME type (without the charset) 
> {{activities.checkMimeTypeIndexable(contentType);}} should be ingested.
> When sending the actual {{RepositoryDocument}} it sets the full MIME type 
> (with the charset) in the document. This is no major bug, but a small 
> inconsistency since the HttpPoster of the Solr Output Connector performs a 
> "hard" check of the MIME type again which can have different outcome than the 
> preceding check activity.
> I think this was introduced or (better) revealed with CONNECTORS-1482.
> Example:
> - In my scenario a crawled webpage has Content-Type {{text/html; 
> charset=utf-8}}
> - the {{activities.checkMimeTypeIndexable(contentType);}} is called with 
> {{text/html}}
> - the hard check performed by the Solr Connector is called with {{text/html; 
> charset=utf-8}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline

2019-08-23 Thread Markus Schuch (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Schuch reassigned CONNECTORS-1571:
-

Assignee: Markus Schuch

> Web Crawler Connector checks different MIME type than it is sending down the 
> pipeline
> -
>
> Key: CONNECTORS-1571
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Minor
>
> The Web Crawler Connector extracts the MIME type from the request 
> Content-Type header.
> Then it truncates the possible {{charset=whatever_encoding}} and lets the 
> pipeline check if the resulting MIME type (without the charset) 
> {{activities.checkMimeTypeIndexable(contentType);}} should be ingested.
> When sending the actual {{RepositoryDocument}} it sets the full MIME type 
> (with the charset) in the document. This is no major bug, but a small 
> inconsistency since the HttpPoster of the Solr Output Connector performs a 
> "hard" check of the MIME type again which can have different outcome than the 
> preceding check activity.
> I think this was introduced or (better) revealed with CONNECTORS-1482.
> Example:
> - In my scenario a crawled webpage has Content-Type {{text/html; 
> charset=utf-8}}
> - the {{activities.checkMimeTypeIndexable(contentType);}} is called with 
> {{text/html}}
> - the hard check performed by the Solr Connector is called with {{text/html; 
> charset=utf-8}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline

2019-08-23 Thread Markus Schuch (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914580#comment-16914580
 ] 

Markus Schuch commented on CONNECTORS-1571:
---

This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the 
mime type with parameters.

> Web Crawler Connector checks different MIME type than it is sending down the 
> pipeline
> -
>
> Key: CONNECTORS-1571
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Markus Schuch
>Priority: Minor
>
> The Web Crawler Connector extracts the MIME type from the request 
> Content-Type header.
> Then it truncates the possible {{charset=whatever_encoding}} and lets the 
> pipeline check if the resulting MIME type (without the charset) 
> {{activities.checkMimeTypeIndexable(contentType);}} should be ingested.
> When sending the actual {{RepositoryDocument}} it sets the full MIME type 
> (with the charset) in the document. This is no major bug, but a small 
> inconsistency since the HttpPoster of the Solr Output Connector performs a 
> "hard" check of the MIME type again which can have different outcome than the 
> preceding check activity.
> I think this was introduced or (better) revealed with CONNECTORS-1482.
> Example:
> - In my scenario a crawled webpage has Content-Type {{text/html; 
> charset=utf-8}}
> - the {{activities.checkMimeTypeIndexable(contentType);}} is called with 
> {{text/html}}
> - the hard check performed by the Solr Connector is called with {{text/html; 
> charset=utf-8}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (CONNECTORS-1621) Fix for CONNECTORS-1482 broke Solr / Tika integration

2019-08-23 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1621.
-
Resolution: Fixed

r1865744


> Fix for CONNECTORS-1482 broke Solr / Tika integration
> -
>
> Key: CONNECTORS-1621
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1621
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.14
>
>
> When you use ManifoldCF with Tika extraction and Solr indexing via the Update 
> handler, all documents except text documents get rejected by the Solr 
> connector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (CONNECTORS-1621) Fix for CONNECTORS-1482 broke Solr / Tika integration

2019-08-23 Thread Karl Wright (Jira)
Karl Wright created CONNECTORS-1621:
---

 Summary: Fix for CONNECTORS-1482 broke Solr / Tika integration
 Key: CONNECTORS-1621
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1621
 Project: ManifoldCF
  Issue Type: Bug
  Components: Lucene/SOLR connector
Affects Versions: ManifoldCF 2.13
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.14


When you use ManifoldCF with Tika extraction and Solr indexing via the Update 
handler, all documents except text documents get rejected by the Solr connector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)