Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Karl Wright Fri, 23 Aug 2019 04:46:17 -0700

Created a ticket: CONNECTORS-1621.  Added a fix.  Please let me know if it
resolves the problem for you.


Thanks,
Karl


On Fri, Aug 23, 2019 at 7:33 AM Karl Wright <daddy...@gmail.com> wrote:

> Hi Markus,
>
> You are correct.
> This code was added as part of
> https://issues.apache.org/jira/browse/CONNECTORS-1482 .  The code that
> was added does look at the content mime type.
>
> The reason that the mime type is not modified in the document being passed
> to Solr by Tika is because we want Solr to receive the original mime type,
> because that may be of interest at indexing time.  So a filter specified in
> the solr connector should always be against the original mime type and not
> the modified one.
>
> Let me make that change.
>
> Karl
>
>
> On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch <markus_sch...@web.de>
> wrote:
>
>> I already have "update" in the handler field. One can see that in the
>> gist link i posted and it is not working.
>>
>> The HttpPoster of the SolrConnector takes
>> RepositoryDocument.getMimeType() and checks the mime type against the
>> hardcoded plain text mime type list, if solr cell mode (extracting
>> request handler mode) is disabled.
>>
>> I think org.apache.manifoldcf.agents.transformation.tika.TikaExtractor
>> never calling setMimeType on the duplicated RepositoryDocument to set
>> the MIME type to text/plain might be the source of my problem.
>>
>> Markus
>>
>> Am 23.08.2019 um 10:30 schrieb Karl Wright:
>> > There are two possible ways to configure Tika with Solr.
>> > First way: Tika extractor + Solr update handler
>> > Second way: no Tika extractor + Solr update/extract handler
>> >
>> > For the first way, the Solr Connector completely ignores any "accepted
>> > mime types" you set for it, and only accepts text/plain.  For the second
>> > way, what you set in the "accepted mime types" is used to filter out
>> > what is being crawled.  You NEVER include the charset, by the way, in
>> > the mime type you specify; that's supposed to get stripped off by anyone
>> > who passes it between connectors.
>> >
>> > Both of these have been extensively used by many others.
>> >
>> > So what you need to do is change to the solr Update handler, sounds to
>> > me.  That's not just unchecking the box, it is also entering "update"
>> > rather than "update/extract" in the handler field.
>> >
>> > If you still use the update/extract handler, you are essentially
>> > invoking Tika twice, which is why we don't really support this option
>> > very well.  But you should be able to just have it accept "text/plain"
>> > and it should work.  OR uncheck the box and it should just default to
>> > allowing "text/plain" with no other options accepted.
>> >
>> > Karl
>> >
>> >
>> > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch <markus_sch...@web.de
>> > <mailto:markus_sch...@web.de>> wrote:
>> >
>> >     Hi Karl,
>> >
>> >     what do i have to do to make tika declare the extracted plain text
>> with
>> >     mime type text/plain in my setup?
>> >
>> >     As i said, i have a tika extractor in place:
>> >
>> >         Pipeline:
>> >         1) Webcrawler Connector (Repository Connection)
>> >         2) Tika Extractor (Transformation)
>> >         3) Solr Connector (Output Connection,
>> >                            Extracting Update Handler disabled)
>> >
>> >     This transformer does not populate the
>> RepositoryDocument.setMimeType()
>> >     field with the value "text/plain". It just asks the downstream
>> pipeline
>> >     if text/plain is indexable, but it then sends the extracted text
>> along
>> >     with the original mime type in my setup.
>> >
>> >     My output connection:
>> >     https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
>> >
>> >     My job/pipeline configuration:
>> >     https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
>> >
>> >     History screenshot attached (hope that works on mailing lists...)
>> >
>> >     My MCF Version is trunk (r1865689)
>> >
>> >     Markus
>> >
>> >
>> >     Am 23.08.2019 um 01:17 schrieb Karl Wright:
>> >     > Hi Markus,
>> >     >
>> >     > If you use the straight update handler, with no Tika filter, then
>> the
>> >     > Solr Connector by design restricts input to textual documents.
>> We can
>> >     > perhaps broaden that to web pages but then you will be indexing
>> HTML
>> >     > tags as well and I rather doubt that's what you want.
>> >     >
>> >     > If you run Tika within ManifoldCF, the mime type it presents to
>> the
>> >     > update handler is text/plain.
>> >     > If you run via the extracting update handler, then there is no
>> content
>> >     > type check done by the Solr connector.
>> >     >
>> >     > Karl
>> >     >
>> >     >
>> >     > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch
>> >     <markus_sch...@web.de <mailto:markus_sch...@web.de>
>> >     > <mailto:markus_sch...@web.de <mailto:markus_sch...@web.de>>>
>> wrote:
>> >     >
>> >     >     Hi,
>> >     >
>> >     >     i am playing around with the solrj mode of the solr output
>> >     connector, to
>> >     >     avoid running tika extraction in solr.
>> >     >
>> >     >     My problem is, that the ingestion of web pages gets rejected
>> >     with the
>> >     >     message
>> >     >
>> >     >         "Solr connector rejected document due to mime type
>> >     restrictions:
>> >     >         (text/html; charset=UTF-8)"
>> >     >
>> >     >     My pipeline looks like this:
>> >     >
>> >     >         1) Webcrawler Connector (Repository Connection)
>> >     >         2) Tika Extractor (Transformation)
>> >     >         3) Solr Connector (Output Connection)
>> >     >
>> >     >     The webserver returns content type "text/html; charset=UTF-8"
>> for
>> >     >     the pages.
>> >     >
>> >     >     The "Use extracting request handler" option is disabled in the
>> >     solr
>> >     >     output connection.
>> >     >
>> >     >     The mimetype inclusions in the solr output connector are:
>> >     >
>> >     >         text/plain;charset=utf-8
>> >     >         text/html
>> >     >         text/html; charset=UTF-8
>> >     >
>> >     >     I think the ingestion gets rejected by the HttpPoster,
>> because it
>> >     >     performs a hard check that the mime type has to be a
>> >     "text/plain*" type
>> >     >     (see acceptableMimeTypes in HttpPoster).
>> >     >
>> >     >     The TikaExtractor asks if downstream pipeline accepts
>> >     >     "text/plain;charset=utf-8" as this is the result of the
>> >     extraction. But
>> >     >     the sent RepositoryDocument still carries the original
>> >     mimetype before
>> >     >     the extraction.
>> >     >
>> >     >     Is this a bug or am i missing something?
>> >     >
>> >     >     Many thanks in advance
>> >     >     Markus
>> >     >
>> >
>>
>

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Reply via email to