Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Karl Wright Thu, 22 Aug 2019 16:18:19 -0700

Hi Markus,

If you use the straight update handler, with no Tika filter, then the Solr
Connector by design restricts input to textual documents.  We can perhaps
broaden that to web pages but then you will be indexing HTML tags as well
and I rather doubt that's what you want.


If you run Tika within ManifoldCF, the mime type it presents to the update
handler is text/plain.
If you run via the extracting update handler, then there is no content type
check done by the Solr connector.

Karl


On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch <markus_sch...@web.de> wrote:

> Hi,
>
> i am playing around with the solrj mode of the solr output connector, to
> avoid running tika extraction in solr.
>
> My problem is, that the ingestion of web pages gets rejected with the
> message
>
>     "Solr connector rejected document due to mime type restrictions:
>     (text/html; charset=UTF-8)"
>
> My pipeline looks like this:
>
>     1) Webcrawler Connector (Repository Connection)
>     2) Tika Extractor (Transformation)
>     3) Solr Connector (Output Connection)
>
> The webserver returns content type "text/html; charset=UTF-8" for the
> pages.
>
> The "Use extracting request handler" option is disabled in the solr
> output connection.
>
> The mimetype inclusions in the solr output connector are:
>
>     text/plain;charset=utf-8
>     text/html
>     text/html; charset=UTF-8
>
> I think the ingestion gets rejected by the HttpPoster, because it
> performs a hard check that the mime type has to be a "text/plain*" type
> (see acceptableMimeTypes in HttpPoster).
>
> The TikaExtractor asks if downstream pipeline accepts
> "text/plain;charset=utf-8" as this is the result of the extraction. But
> the sent RepositoryDocument still carries the original mimetype before
> the extraction.
>
> Is this a bug or am i missing something?
>
> Many thanks in advance
> Markus
>

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Reply via email to