Hi,

i am playing around with the solrj mode of the solr output connector, to
avoid running tika extraction in solr.

My problem is, that the ingestion of web pages gets rejected with the
message

    "Solr connector rejected document due to mime type restrictions:
    (text/html; charset=UTF-8)"

My pipeline looks like this:

    1) Webcrawler Connector (Repository Connection)
    2) Tika Extractor (Transformation)
    3) Solr Connector (Output Connection)

The webserver returns content type "text/html; charset=UTF-8" for the pages.

The "Use extracting request handler" option is disabled in the solr
output connection.

The mimetype inclusions in the solr output connector are:

    text/plain;charset=utf-8
    text/html
    text/html; charset=UTF-8

I think the ingestion gets rejected by the HttpPoster, because it
performs a hard check that the mime type has to be a "text/plain*" type
(see acceptableMimeTypes in HttpPoster).

The TikaExtractor asks if downstream pipeline accepts
"text/plain;charset=utf-8" as this is the result of the extraction. But
the sent RepositoryDocument still carries the original mimetype before
the extraction.

Is this a bug or am i missing something?

Many thanks in advance
Markus

Reply via email to