Hi Karl, what do i have to do to make tika declare the extracted plain text with mime type text/plain in my setup?
As i said, i have a tika extractor in place: Pipeline: 1) Webcrawler Connector (Repository Connection) 2) Tika Extractor (Transformation) 3) Solr Connector (Output Connection, Extracting Update Handler disabled) This transformer does not populate the RepositoryDocument.setMimeType() field with the value "text/plain". It just asks the downstream pipeline if text/plain is indexable, but it then sends the extracted text along with the original mime type in my setup. My output connection: https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d My job/pipeline configuration: https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b History screenshot attached (hope that works on mailing lists...) My MCF Version is trunk (r1865689) Markus Am 23.08.2019 um 01:17 schrieb Karl Wright: > Hi Markus, > > If you use the straight update handler, with no Tika filter, then the > Solr Connector by design restricts input to textual documents. We can > perhaps broaden that to web pages but then you will be indexing HTML > tags as well and I rather doubt that's what you want. > > If you run Tika within ManifoldCF, the mime type it presents to the > update handler is text/plain. > If you run via the extracting update handler, then there is no content > type check done by the Solr connector. > > Karl > > > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch <markus_sch...@web.de > <mailto:markus_sch...@web.de>> wrote: > > Hi, > > i am playing around with the solrj mode of the solr output connector, to > avoid running tika extraction in solr. > > My problem is, that the ingestion of web pages gets rejected with the > message > > "Solr connector rejected document due to mime type restrictions: > (text/html; charset=UTF-8)" > > My pipeline looks like this: > > 1) Webcrawler Connector (Repository Connection) > 2) Tika Extractor (Transformation) > 3) Solr Connector (Output Connection) > > The webserver returns content type "text/html; charset=UTF-8" for > the pages. > > The "Use extracting request handler" option is disabled in the solr > output connection. > > The mimetype inclusions in the solr output connector are: > > text/plain;charset=utf-8 > text/html > text/html; charset=UTF-8 > > I think the ingestion gets rejected by the HttpPoster, because it > performs a hard check that the mime type has to be a "text/plain*" type > (see acceptableMimeTypes in HttpPoster). > > The TikaExtractor asks if downstream pipeline accepts > "text/plain;charset=utf-8" as this is the result of the extraction. But > the sent RepositoryDocument still carries the original mimetype before > the extraction. > > Is this a bug or am i missing something? > > Many thanks in advance > Markus >