Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
Hi Karl, yes, this helps. The webpage is now ingested after tika extraction and i only have to include the mime type text/html in the solr output connection. Many thanks. Cheers Markus Am 23.08.2019 um 13:45 schrieb Karl Wright: > Created a ticket: CONNECTORS-1621.  Added a fix.  Please let

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
Created a ticket: CONNECTORS-1621. Added a fix. Please let me know if it resolves the problem for you. Thanks, Karl On Fri, Aug 23, 2019 at 7:33 AM Karl Wright wrote: > Hi Markus, > > You are correct. > This code was added as part of > https://issues.apache.org/jira/browse/CONNECTORS-1482 .

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
Hi Markus, You are correct. This code was added as part of https://issues.apache.org/jira/browse/CONNECTORS-1482 . The code that was added does look at the content mime type. The reason that the mime type is not modified in the document being passed to Solr by Tika is because we want Solr to

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
I already have "update" in the handler field. One can see that in the gist link i posted and it is not working. The HttpPoster of the SolrConnector takes RepositoryDocument.getMimeType() and checks the mime type against the hardcoded plain text mime type list, if solr cell mode (extracting

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
There are two possible ways to configure Tika with Solr. First way: Tika extractor + Solr update handler Second way: no Tika extractor + Solr update/extract handler For the first way, the Solr Connector completely ignores any "accepted mime types" you set for it, and only accepts text/plain. For

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
Hi Karl, what do i have to do to make tika declare the extracted plain text with mime type text/plain in my setup? As i said, i have a tika extractor in place: Pipeline: 1) Webcrawler Connector (Repository Connection) 2) Tika Extractor (Transformation) 3) Solr Connector (Output