[jira] [Resolved] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline
[ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Schuch resolved CONNECTORS-1571. --- Resolution: Not A Problem > Web Crawler Connector checks different MIME type than it is sending down the > pipeline > - > > Key: CONNECTORS-1571 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1571 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.10 >Reporter: Markus Schuch >Priority: Minor > > The Web Crawler Connector extracts the MIME type from the request > Content-Type header. > Then it truncates the possible {{charset=whatever_encoding}} and lets the > pipeline check if the resulting MIME type (without the charset) > {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. > When sending the actual {{RepositoryDocument}} it sets the full MIME type > (with the charset) in the document. This is no major bug, but a small > inconsistency since the HttpPoster of the Solr Output Connector performs a > "hard" check of the MIME type again which can have different outcome than the > preceding check activity. > I think this was introduced or (better) revealed with CONNECTORS-1482. > Example: > - In my scenario a crawled webpage has Content-Type {{text/html; > charset=utf-8}} > - the {{activities.checkMimeTypeIndexable(contentType);}} is called with > {{text/html}} > - the hard check performed by the Solr Connector is called with {{text/html; > charset=utf-8}} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline
[ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914580#comment-16914580 ] Markus Schuch edited comment on CONNECTORS-1571 at 8/23/19 8:15 PM: This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the mime type with parameters, so this is not a problem any more. was (Author: schuchm): This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the mime type with parameters. > Web Crawler Connector checks different MIME type than it is sending down the > pipeline > - > > Key: CONNECTORS-1571 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1571 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.10 >Reporter: Markus Schuch >Priority: Minor > > The Web Crawler Connector extracts the MIME type from the request > Content-Type header. > Then it truncates the possible {{charset=whatever_encoding}} and lets the > pipeline check if the resulting MIME type (without the charset) > {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. > When sending the actual {{RepositoryDocument}} it sets the full MIME type > (with the charset) in the document. This is no major bug, but a small > inconsistency since the HttpPoster of the Solr Output Connector performs a > "hard" check of the MIME type again which can have different outcome than the > preceding check activity. > I think this was introduced or (better) revealed with CONNECTORS-1482. > Example: > - In my scenario a crawled webpage has Content-Type {{text/html; > charset=utf-8}} > - the {{activities.checkMimeTypeIndexable(contentType);}} is called with > {{text/html}} > - the hard check performed by the Solr Connector is called with {{text/html; > charset=utf-8}} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline
[ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Schuch reassigned CONNECTORS-1571: - Assignee: Markus Schuch > Web Crawler Connector checks different MIME type than it is sending down the > pipeline > - > > Key: CONNECTORS-1571 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1571 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.10 >Reporter: Markus Schuch >Assignee: Markus Schuch >Priority: Minor > > The Web Crawler Connector extracts the MIME type from the request > Content-Type header. > Then it truncates the possible {{charset=whatever_encoding}} and lets the > pipeline check if the resulting MIME type (without the charset) > {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. > When sending the actual {{RepositoryDocument}} it sets the full MIME type > (with the charset) in the document. This is no major bug, but a small > inconsistency since the HttpPoster of the Solr Output Connector performs a > "hard" check of the MIME type again which can have different outcome than the > preceding check activity. > I think this was introduced or (better) revealed with CONNECTORS-1482. > Example: > - In my scenario a crawled webpage has Content-Type {{text/html; > charset=utf-8}} > - the {{activities.checkMimeTypeIndexable(contentType);}} is called with > {{text/html}} > - the hard check performed by the Solr Connector is called with {{text/html; > charset=utf-8}} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline
[ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914580#comment-16914580 ] Markus Schuch commented on CONNECTORS-1571: --- This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the mime type with parameters. > Web Crawler Connector checks different MIME type than it is sending down the > pipeline > - > > Key: CONNECTORS-1571 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1571 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.10 >Reporter: Markus Schuch >Priority: Minor > > The Web Crawler Connector extracts the MIME type from the request > Content-Type header. > Then it truncates the possible {{charset=whatever_encoding}} and lets the > pipeline check if the resulting MIME type (without the charset) > {{activities.checkMimeTypeIndexable(contentType);}} should be ingested. > When sending the actual {{RepositoryDocument}} it sets the full MIME type > (with the charset) in the document. This is no major bug, but a small > inconsistency since the HttpPoster of the Solr Output Connector performs a > "hard" check of the MIME type again which can have different outcome than the > preceding check activity. > I think this was introduced or (better) revealed with CONNECTORS-1482. > Example: > - In my scenario a crawled webpage has Content-Type {{text/html; > charset=utf-8}} > - the {{activities.checkMimeTypeIndexable(contentType);}} is called with > {{text/html}} > - the hard check performed by the Solr Connector is called with {{text/html; > charset=utf-8}} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (CONNECTORS-1621) Fix for CONNECTORS-1482 broke Solr / Tika integration
[ https://issues.apache.org/jira/browse/CONNECTORS-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1621. - Resolution: Fixed r1865744 > Fix for CONNECTORS-1482 broke Solr / Tika integration > - > > Key: CONNECTORS-1621 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1621 > Project: ManifoldCF > Issue Type: Bug > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.13 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.14 > > > When you use ManifoldCF with Tika extraction and Solr indexing via the Update > handler, all documents except text documents get rejected by the Solr > connector. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (CONNECTORS-1621) Fix for CONNECTORS-1482 broke Solr / Tika integration
Karl Wright created CONNECTORS-1621: --- Summary: Fix for CONNECTORS-1482 broke Solr / Tika integration Key: CONNECTORS-1621 URL: https://issues.apache.org/jira/browse/CONNECTORS-1621 Project: ManifoldCF Issue Type: Bug Components: Lucene/SOLR connector Affects Versions: ManifoldCF 2.13 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 2.14 When you use ManifoldCF with Tika extraction and Solr indexing via the Update handler, all documents except text documents get rejected by the Solr connector. -- This message was sent by Atlassian Jira (v8.3.2#803003)