Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Karl Wright Fri, 23 Aug 2019 04:34:09 -0700

Hi Markus,

You are correct.
This code was added as part of
https://issues.apache.org/jira/browse/CONNECTORS-1482 .  The code that was
added does look at the content mime type.


The reason that the mime type is not modified in the document being passed
to Solr by Tika is because we want Solr to receive the original mime type,
because that may be of interest at indexing time.  So a filter specified in
the solr connector should always be against the original mime type and not
the modified one.

Let me make that change.

Karl


On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch <[email protected]> wrote:

> I already have "update" in the handler field. One can see that in the
> gist link i posted and it is not working.
>
> The HttpPoster of the SolrConnector takes
> RepositoryDocument.getMimeType() and checks the mime type against the
> hardcoded plain text mime type list, if solr cell mode (extracting
> request handler mode) is disabled.
>
> I think org.apache.manifoldcf.agents.transformation.tika.TikaExtractor
> never calling setMimeType on the duplicated RepositoryDocument to set
> the MIME type to text/plain might be the source of my problem.
>
> Markus
>
> Am 23.08.2019 um 10:30 schrieb Karl Wright:
> > There are two possible ways to configure Tika with Solr.
> > First way: Tika extractor + Solr update handler
> > Second way: no Tika extractor + Solr update/extract handler
> >
> > For the first way, the Solr Connector completely ignores any "accepted
> > mime types" you set for it, and only accepts text/plain.  For the second
> > way, what you set in the "accepted mime types" is used to filter out
> > what is being crawled.  You NEVER include the charset, by the way, in
> > the mime type you specify; that's supposed to get stripped off by anyone
> > who passes it between connectors.
> >
> > Both of these have been extensively used by many others.
> >
> > So what you need to do is change to the solr Update handler, sounds to
> > me.  That's not just unchecking the box, it is also entering "update"
> > rather than "update/extract" in the handler field.
> >
> > If you still use the update/extract handler, you are essentially
> > invoking Tika twice, which is why we don't really support this option
> > very well.  But you should be able to just have it accept "text/plain"
> > and it should work.  OR uncheck the box and it should just default to
> > allowing "text/plain" with no other options accepted.
> >
> > Karl
> >
> >
> > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi Karl,
> >
> >     what do i have to do to make tika declare the extracted plain text
> with
> >     mime type text/plain in my setup?
> >
> >     As i said, i have a tika extractor in place:
> >
> >         Pipeline:
> >         1) Webcrawler Connector (Repository Connection)
> >         2) Tika Extractor (Transformation)
> >         3) Solr Connector (Output Connection,
> >                            Extracting Update Handler disabled)
> >
> >     This transformer does not populate the
> RepositoryDocument.setMimeType()
> >     field with the value "text/plain". It just asks the downstream
> pipeline
> >     if text/plain is indexable, but it then sends the extracted text
> along
> >     with the original mime type in my setup.
> >
> >     My output connection:
> >     https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
> >
> >     My job/pipeline configuration:
> >     https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
> >
> >     History screenshot attached (hope that works on mailing lists...)
> >
> >     My MCF Version is trunk (r1865689)
> >
> >     Markus
> >
> >
> >     Am 23.08.2019 um 01:17 schrieb Karl Wright:
> >     > Hi Markus,
> >     >
> >     > If you use the straight update handler, with no Tika filter, then
> the
> >     > Solr Connector by design restricts input to textual documents.  We
> can
> >     > perhaps broaden that to web pages but then you will be indexing
> HTML
> >     > tags as well and I rather doubt that's what you want.
> >     >
> >     > If you run Tika within ManifoldCF, the mime type it presents to the
> >     > update handler is text/plain.
> >     > If you run via the extracting update handler, then there is no
> content
> >     > type check done by the Solr connector.
> >     >
> >     > Karl
> >     >
> >     >
> >     > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch
> >     <[email protected] <mailto:[email protected]>
> >     > <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >     >
> >     >     Hi,
> >     >
> >     >     i am playing around with the solrj mode of the solr output
> >     connector, to
> >     >     avoid running tika extraction in solr.
> >     >
> >     >     My problem is, that the ingestion of web pages gets rejected
> >     with the
> >     >     message
> >     >
> >     >         "Solr connector rejected document due to mime type
> >     restrictions:
> >     >         (text/html; charset=UTF-8)"
> >     >
> >     >     My pipeline looks like this:
> >     >
> >     >         1) Webcrawler Connector (Repository Connection)
> >     >         2) Tika Extractor (Transformation)
> >     >         3) Solr Connector (Output Connection)
> >     >
> >     >     The webserver returns content type "text/html; charset=UTF-8"
> for
> >     >     the pages.
> >     >
> >     >     The "Use extracting request handler" option is disabled in the
> >     solr
> >     >     output connection.
> >     >
> >     >     The mimetype inclusions in the solr output connector are:
> >     >
> >     >         text/plain;charset=utf-8
> >     >         text/html
> >     >         text/html; charset=UTF-8
> >     >
> >     >     I think the ingestion gets rejected by the HttpPoster, because
> it
> >     >     performs a hard check that the mime type has to be a
> >     "text/plain*" type
> >     >     (see acceptableMimeTypes in HttpPoster).
> >     >
> >     >     The TikaExtractor asks if downstream pipeline accepts
> >     >     "text/plain;charset=utf-8" as this is the result of the
> >     extraction. But
> >     >     the sent RepositoryDocument still carries the original
> >     mimetype before
> >     >     the extraction.
> >     >
> >     >     Is this a bug or am i missing something?
> >     >
> >     >     Many thanks in advance
> >     >     Markus
> >     >
> >
>

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Reply via email to