Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Karl Wright Fri, 23 Aug 2019 01:30:57 -0700

There are two possible ways to configure Tika with Solr.
First way: Tika extractor + Solr update handler
Second way: no Tika extractor + Solr update/extract handler


For the first way, the Solr Connector completely ignores any "accepted mime
types" you set for it, and only accepts text/plain.  For the second way,
what you set in the "accepted mime types" is used to filter out what is
being crawled.  You NEVER include the charset, by the way, in the mime type
you specify; that's supposed to get stripped off by anyone who passes it
between connectors.

Both of these have been extensively used by many others.

So what you need to do is change to the solr Update handler, sounds to me.
That's not just unchecking the box, it is also entering "update" rather
than "update/extract" in the handler field.

If you still use the update/extract handler, you are essentially invoking
Tika twice, which is why we don't really support this option very well.
But you should be able to just have it accept "text/plain" and it should
work.  OR uncheck the box and it should just default to allowing
"text/plain" with no other options accepted.

Karl


On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch <markus_sch...@web.de> wrote:

> Hi Karl,
>
> what do i have to do to make tika declare the extracted plain text with
> mime type text/plain in my setup?
>
> As i said, i have a tika extractor in place:
>
>     Pipeline:
>     1) Webcrawler Connector (Repository Connection)
>     2) Tika Extractor (Transformation)
>     3) Solr Connector (Output Connection,
>                        Extracting Update Handler disabled)
>
> This transformer does not populate the RepositoryDocument.setMimeType()
> field with the value "text/plain". It just asks the downstream pipeline
> if text/plain is indexable, but it then sends the extracted text along
> with the original mime type in my setup.
>
> My output connection:
> https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
>
> My job/pipeline configuration:
> https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
>
> History screenshot attached (hope that works on mailing lists...)
>
> My MCF Version is trunk (r1865689)
>
> Markus
>
>
> Am 23.08.2019 um 01:17 schrieb Karl Wright:
> > Hi Markus,
> >
> > If you use the straight update handler, with no Tika filter, then the
> > Solr Connector by design restricts input to textual documents.  We can
> > perhaps broaden that to web pages but then you will be indexing HTML
> > tags as well and I rather doubt that's what you want.
> >
> > If you run Tika within ManifoldCF, the mime type it presents to the
> > update handler is text/plain.
> > If you run via the extracting update handler, then there is no content
> > type check done by the Solr connector.
> >
> > Karl
> >
> >
> > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch <markus_sch...@web.de
> > <mailto:markus_sch...@web.de>> wrote:
> >
> >     Hi,
> >
> >     i am playing around with the solrj mode of the solr output
> connector, to
> >     avoid running tika extraction in solr.
> >
> >     My problem is, that the ingestion of web pages gets rejected with the
> >     message
> >
> >         "Solr connector rejected document due to mime type restrictions:
> >         (text/html; charset=UTF-8)"
> >
> >     My pipeline looks like this:
> >
> >         1) Webcrawler Connector (Repository Connection)
> >         2) Tika Extractor (Transformation)
> >         3) Solr Connector (Output Connection)
> >
> >     The webserver returns content type "text/html; charset=UTF-8" for
> >     the pages.
> >
> >     The "Use extracting request handler" option is disabled in the solr
> >     output connection.
> >
> >     The mimetype inclusions in the solr output connector are:
> >
> >         text/plain;charset=utf-8
> >         text/html
> >         text/html; charset=UTF-8
> >
> >     I think the ingestion gets rejected by the HttpPoster, because it
> >     performs a hard check that the mime type has to be a "text/plain*"
> type
> >     (see acceptableMimeTypes in HttpPoster).
> >
> >     The TikaExtractor asks if downstream pipeline accepts
> >     "text/plain;charset=utf-8" as this is the result of the extraction.
> But
> >     the sent RepositoryDocument still carries the original mimetype
> before
> >     the extraction.
> >
> >     Is this a bug or am i missing something?
> >
> >     Many thanks in advance
> >     Markus
> >
>

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Reply via email to