Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
Hi Karl,

yes, this helps.

The webpage is now ingested after tika extraction and i only have to
include the mime type text/html in the solr output connection.

Many thanks.

Cheers
Markus

Am 23.08.2019 um 13:45 schrieb Karl Wright:
> Created a ticket: CONNECTORS-1621.  Added a fix.  Please let me know if
> it resolves the problem for you.
>
> Thanks,
> Karl
>
>
> On Fri, Aug 23, 2019 at 7:33 AM Karl Wright  > wrote:
>
> Hi Markus,
>
> You are correct.
> This code was added as part
> of https://issues.apache.org/jira/browse/CONNECTORS-1482 .  The code
> that was added does look at the content mime type.  
>
> The reason that the mime type is not modified in the document being
> passed to Solr by Tika is because we want Solr to receive the
> original mime type, because that may be of interest at indexing
> time.  So a filter specified in the solr connector should always be
> against the original mime type and not the modified one.
>
> Let me make that change.
>
> Karl
>
>
> On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch  > wrote:
>
> I already have "update" in the handler field. One can see that
> in the
> gist link i posted and it is not working.
>
> The HttpPoster of the SolrConnector takes
> RepositoryDocument.getMimeType() and checks the mime type
> against the
> hardcoded plain text mime type list, if solr cell mode (extracting
> request handler mode) is disabled.
>
> I think
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor
> never calling setMimeType on the duplicated RepositoryDocument
> to set
> the MIME type to text/plain might be the source of my problem.
>
> Markus
>
> Am 23.08.2019 um 10:30 schrieb Karl Wright:
> > There are two possible ways to configure Tika with Solr.
> > First way: Tika extractor + Solr update handler
> > Second way: no Tika extractor + Solr update/extract handler
> >
> > For the first way, the Solr Connector completely ignores any
> "accepted
> > mime types" you set for it, and only accepts text/plain.  For
> the second
> > way, what you set in the "accepted mime types" is used to
> filter out
> > what is being crawled.  You NEVER include the charset, by the
> way, in
> > the mime type you specify; that's supposed to get stripped off
> by anyone
> > who passes it between connectors.
> >
> > Both of these have been extensively used by many others.
> >
> > So what you need to do is change to the solr Update handler,
> sounds to
> > me.  That's not just unchecking the box, it is also entering
> "update"
> > rather than "update/extract" in the handler field.
> >
> > If you still use the update/extract handler, you are essentially
> > invoking Tika twice, which is why we don't really support this
> option
> > very well.  But you should be able to just have it accept
> "text/plain"
> > and it should work.  OR uncheck the box and it should just
> default to
> > allowing "text/plain" with no other options accepted.
> >
> > Karl
> >
> >
> > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch
> mailto:markus_sch...@web.de>
> > >>
> wrote:
> >
> >     Hi Karl,
> >
> >     what do i have to do to make tika declare the extracted
> plain text with
> >     mime type text/plain in my setup?
> >
> >     As i said, i have a tika extractor in place:
> >
> >         Pipeline:
> >         1) Webcrawler Connector (Repository Connection)
> >         2) Tika Extractor (Transformation)
> >         3) Solr Connector (Output Connection,
> >                            Extracting Update Handler disabled)
> >
> >     This transformer does not populate the
> RepositoryDocument.setMimeType()
> >     field with the value "text/plain". It just asks the
> downstream pipeline
> >     if text/plain is indexable, but it then sends the
> extracted text along
> >     with the original mime type in my setup.
> >
> >     My output connection:
> >   
>  https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
> >
> >     My job/pipeline configuration:
> >   
>  https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
> >
> >     History screenshot attached (hope that works on mailing
> lists...)
> >
> >     My MCF 

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
Created a ticket: CONNECTORS-1621.  Added a fix.  Please let me know if it
resolves the problem for you.

Thanks,
Karl


On Fri, Aug 23, 2019 at 7:33 AM Karl Wright  wrote:

> Hi Markus,
>
> You are correct.
> This code was added as part of
> https://issues.apache.org/jira/browse/CONNECTORS-1482 .  The code that
> was added does look at the content mime type.
>
> The reason that the mime type is not modified in the document being passed
> to Solr by Tika is because we want Solr to receive the original mime type,
> because that may be of interest at indexing time.  So a filter specified in
> the solr connector should always be against the original mime type and not
> the modified one.
>
> Let me make that change.
>
> Karl
>
>
> On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch 
> wrote:
>
>> I already have "update" in the handler field. One can see that in the
>> gist link i posted and it is not working.
>>
>> The HttpPoster of the SolrConnector takes
>> RepositoryDocument.getMimeType() and checks the mime type against the
>> hardcoded plain text mime type list, if solr cell mode (extracting
>> request handler mode) is disabled.
>>
>> I think org.apache.manifoldcf.agents.transformation.tika.TikaExtractor
>> never calling setMimeType on the duplicated RepositoryDocument to set
>> the MIME type to text/plain might be the source of my problem.
>>
>> Markus
>>
>> Am 23.08.2019 um 10:30 schrieb Karl Wright:
>> > There are two possible ways to configure Tika with Solr.
>> > First way: Tika extractor + Solr update handler
>> > Second way: no Tika extractor + Solr update/extract handler
>> >
>> > For the first way, the Solr Connector completely ignores any "accepted
>> > mime types" you set for it, and only accepts text/plain.  For the second
>> > way, what you set in the "accepted mime types" is used to filter out
>> > what is being crawled.  You NEVER include the charset, by the way, in
>> > the mime type you specify; that's supposed to get stripped off by anyone
>> > who passes it between connectors.
>> >
>> > Both of these have been extensively used by many others.
>> >
>> > So what you need to do is change to the solr Update handler, sounds to
>> > me.  That's not just unchecking the box, it is also entering "update"
>> > rather than "update/extract" in the handler field.
>> >
>> > If you still use the update/extract handler, you are essentially
>> > invoking Tika twice, which is why we don't really support this option
>> > very well.  But you should be able to just have it accept "text/plain"
>> > and it should work.  OR uncheck the box and it should just default to
>> > allowing "text/plain" with no other options accepted.
>> >
>> > Karl
>> >
>> >
>> > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch > > > wrote:
>> >
>> > Hi Karl,
>> >
>> > what do i have to do to make tika declare the extracted plain text
>> with
>> > mime type text/plain in my setup?
>> >
>> > As i said, i have a tika extractor in place:
>> >
>> > Pipeline:
>> > 1) Webcrawler Connector (Repository Connection)
>> > 2) Tika Extractor (Transformation)
>> > 3) Solr Connector (Output Connection,
>> >Extracting Update Handler disabled)
>> >
>> > This transformer does not populate the
>> RepositoryDocument.setMimeType()
>> > field with the value "text/plain". It just asks the downstream
>> pipeline
>> > if text/plain is indexable, but it then sends the extracted text
>> along
>> > with the original mime type in my setup.
>> >
>> > My output connection:
>> > https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
>> >
>> > My job/pipeline configuration:
>> > https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
>> >
>> > History screenshot attached (hope that works on mailing lists...)
>> >
>> > My MCF Version is trunk (r1865689)
>> >
>> > Markus
>> >
>> >
>> > Am 23.08.2019 um 01:17 schrieb Karl Wright:
>> > > Hi Markus,
>> > >
>> > > If you use the straight update handler, with no Tika filter, then
>> the
>> > > Solr Connector by design restricts input to textual documents.
>> We can
>> > > perhaps broaden that to web pages but then you will be indexing
>> HTML
>> > > tags as well and I rather doubt that's what you want.
>> > >
>> > > If you run Tika within ManifoldCF, the mime type it presents to
>> the
>> > > update handler is text/plain.
>> > > If you run via the extracting update handler, then there is no
>> content
>> > > type check done by the Solr connector.
>> > >
>> > > Karl
>> > >
>> > >
>> > > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch
>> > mailto:markus_sch...@web.de>
>> > > >>
>> wrote:
>> > >
>> > > Hi,
>> > >
>> > > i am playing around with the solrj mode of the solr output
>> > connector, to
>> > > 

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
Hi Markus,

You are correct.
This code was added as part of
https://issues.apache.org/jira/browse/CONNECTORS-1482 .  The code that was
added does look at the content mime type.

The reason that the mime type is not modified in the document being passed
to Solr by Tika is because we want Solr to receive the original mime type,
because that may be of interest at indexing time.  So a filter specified in
the solr connector should always be against the original mime type and not
the modified one.

Let me make that change.

Karl


On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch  wrote:

> I already have "update" in the handler field. One can see that in the
> gist link i posted and it is not working.
>
> The HttpPoster of the SolrConnector takes
> RepositoryDocument.getMimeType() and checks the mime type against the
> hardcoded plain text mime type list, if solr cell mode (extracting
> request handler mode) is disabled.
>
> I think org.apache.manifoldcf.agents.transformation.tika.TikaExtractor
> never calling setMimeType on the duplicated RepositoryDocument to set
> the MIME type to text/plain might be the source of my problem.
>
> Markus
>
> Am 23.08.2019 um 10:30 schrieb Karl Wright:
> > There are two possible ways to configure Tika with Solr.
> > First way: Tika extractor + Solr update handler
> > Second way: no Tika extractor + Solr update/extract handler
> >
> > For the first way, the Solr Connector completely ignores any "accepted
> > mime types" you set for it, and only accepts text/plain.  For the second
> > way, what you set in the "accepted mime types" is used to filter out
> > what is being crawled.  You NEVER include the charset, by the way, in
> > the mime type you specify; that's supposed to get stripped off by anyone
> > who passes it between connectors.
> >
> > Both of these have been extensively used by many others.
> >
> > So what you need to do is change to the solr Update handler, sounds to
> > me.  That's not just unchecking the box, it is also entering "update"
> > rather than "update/extract" in the handler field.
> >
> > If you still use the update/extract handler, you are essentially
> > invoking Tika twice, which is why we don't really support this option
> > very well.  But you should be able to just have it accept "text/plain"
> > and it should work.  OR uncheck the box and it should just default to
> > allowing "text/plain" with no other options accepted.
> >
> > Karl
> >
> >
> > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch  > > wrote:
> >
> > Hi Karl,
> >
> > what do i have to do to make tika declare the extracted plain text
> with
> > mime type text/plain in my setup?
> >
> > As i said, i have a tika extractor in place:
> >
> > Pipeline:
> > 1) Webcrawler Connector (Repository Connection)
> > 2) Tika Extractor (Transformation)
> > 3) Solr Connector (Output Connection,
> >Extracting Update Handler disabled)
> >
> > This transformer does not populate the
> RepositoryDocument.setMimeType()
> > field with the value "text/plain". It just asks the downstream
> pipeline
> > if text/plain is indexable, but it then sends the extracted text
> along
> > with the original mime type in my setup.
> >
> > My output connection:
> > https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
> >
> > My job/pipeline configuration:
> > https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
> >
> > History screenshot attached (hope that works on mailing lists...)
> >
> > My MCF Version is trunk (r1865689)
> >
> > Markus
> >
> >
> > Am 23.08.2019 um 01:17 schrieb Karl Wright:
> > > Hi Markus,
> > >
> > > If you use the straight update handler, with no Tika filter, then
> the
> > > Solr Connector by design restricts input to textual documents.  We
> can
> > > perhaps broaden that to web pages but then you will be indexing
> HTML
> > > tags as well and I rather doubt that's what you want.
> > >
> > > If you run Tika within ManifoldCF, the mime type it presents to the
> > > update handler is text/plain.
> > > If you run via the extracting update handler, then there is no
> content
> > > type check done by the Solr connector.
> > >
> > > Karl
> > >
> > >
> > > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch
> > mailto:markus_sch...@web.de>
> > > >>
> wrote:
> > >
> > > Hi,
> > >
> > > i am playing around with the solrj mode of the solr output
> > connector, to
> > > avoid running tika extraction in solr.
> > >
> > > My problem is, that the ingestion of web pages gets rejected
> > with the
> > > message
> > >
> > > "Solr connector rejected document due to mime type
> > restrictions:
> > > (text/html; charset=UTF-8)"
> > >
> 

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
I already have "update" in the handler field. One can see that in the
gist link i posted and it is not working.

The HttpPoster of the SolrConnector takes
RepositoryDocument.getMimeType() and checks the mime type against the
hardcoded plain text mime type list, if solr cell mode (extracting
request handler mode) is disabled.

I think org.apache.manifoldcf.agents.transformation.tika.TikaExtractor
never calling setMimeType on the duplicated RepositoryDocument to set
the MIME type to text/plain might be the source of my problem.

Markus

Am 23.08.2019 um 10:30 schrieb Karl Wright:
> There are two possible ways to configure Tika with Solr.
> First way: Tika extractor + Solr update handler
> Second way: no Tika extractor + Solr update/extract handler
>
> For the first way, the Solr Connector completely ignores any "accepted
> mime types" you set for it, and only accepts text/plain.  For the second
> way, what you set in the "accepted mime types" is used to filter out
> what is being crawled.  You NEVER include the charset, by the way, in
> the mime type you specify; that's supposed to get stripped off by anyone
> who passes it between connectors.
>
> Both of these have been extensively used by many others.
>
> So what you need to do is change to the solr Update handler, sounds to
> me.  That's not just unchecking the box, it is also entering "update"
> rather than "update/extract" in the handler field.
>
> If you still use the update/extract handler, you are essentially
> invoking Tika twice, which is why we don't really support this option
> very well.  But you should be able to just have it accept "text/plain"
> and it should work.  OR uncheck the box and it should just default to
> allowing "text/plain" with no other options accepted.
>
> Karl
>
>
> On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch  > wrote:
>
> Hi Karl,
>
> what do i have to do to make tika declare the extracted plain text with
> mime type text/plain in my setup?
>
> As i said, i have a tika extractor in place:
>
>     Pipeline:
>     1) Webcrawler Connector (Repository Connection)
>     2) Tika Extractor (Transformation)
>     3) Solr Connector (Output Connection,
>                        Extracting Update Handler disabled)
>
> This transformer does not populate the RepositoryDocument.setMimeType()
> field with the value "text/plain". It just asks the downstream pipeline
> if text/plain is indexable, but it then sends the extracted text along
> with the original mime type in my setup.
>
> My output connection:
> https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
>
> My job/pipeline configuration:
> https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
>
> History screenshot attached (hope that works on mailing lists...)
>
> My MCF Version is trunk (r1865689)
>
> Markus
>
>
> Am 23.08.2019 um 01:17 schrieb Karl Wright:
> > Hi Markus,
> >
> > If you use the straight update handler, with no Tika filter, then the
> > Solr Connector by design restricts input to textual documents.  We can
> > perhaps broaden that to web pages but then you will be indexing HTML
> > tags as well and I rather doubt that's what you want.
> >
> > If you run Tika within ManifoldCF, the mime type it presents to the
> > update handler is text/plain.
> > If you run via the extracting update handler, then there is no content
> > type check done by the Solr connector.
> >
> > Karl
> >
> >
> > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch
> mailto:markus_sch...@web.de>
> > >> wrote:
> >
> >     Hi,
> >
> >     i am playing around with the solrj mode of the solr output
> connector, to
> >     avoid running tika extraction in solr.
> >
> >     My problem is, that the ingestion of web pages gets rejected
> with the
> >     message
> >
> >         "Solr connector rejected document due to mime type
> restrictions:
> >         (text/html; charset=UTF-8)"
> >
> >     My pipeline looks like this:
> >
> >         1) Webcrawler Connector (Repository Connection)
> >         2) Tika Extractor (Transformation)
> >         3) Solr Connector (Output Connection)
> >
> >     The webserver returns content type "text/html; charset=UTF-8" for
> >     the pages.
> >
> >     The "Use extracting request handler" option is disabled in the
> solr
> >     output connection.
> >
> >     The mimetype inclusions in the solr output connector are:
> >
> >         text/plain;charset=utf-8
> >         text/html
> >         text/html; charset=UTF-8
> >
> >     I think the ingestion gets rejected by the HttpPoster, because it
> >     performs a hard check that the mime type has to be a
> "text/plain*" 

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
There are two possible ways to configure Tika with Solr.
First way: Tika extractor + Solr update handler
Second way: no Tika extractor + Solr update/extract handler

For the first way, the Solr Connector completely ignores any "accepted mime
types" you set for it, and only accepts text/plain.  For the second way,
what you set in the "accepted mime types" is used to filter out what is
being crawled.  You NEVER include the charset, by the way, in the mime type
you specify; that's supposed to get stripped off by anyone who passes it
between connectors.

Both of these have been extensively used by many others.

So what you need to do is change to the solr Update handler, sounds to me.
That's not just unchecking the box, it is also entering "update" rather
than "update/extract" in the handler field.

If you still use the update/extract handler, you are essentially invoking
Tika twice, which is why we don't really support this option very well.
But you should be able to just have it accept "text/plain" and it should
work.  OR uncheck the box and it should just default to allowing
"text/plain" with no other options accepted.

Karl


On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch  wrote:

> Hi Karl,
>
> what do i have to do to make tika declare the extracted plain text with
> mime type text/plain in my setup?
>
> As i said, i have a tika extractor in place:
>
> Pipeline:
> 1) Webcrawler Connector (Repository Connection)
> 2) Tika Extractor (Transformation)
> 3) Solr Connector (Output Connection,
>Extracting Update Handler disabled)
>
> This transformer does not populate the RepositoryDocument.setMimeType()
> field with the value "text/plain". It just asks the downstream pipeline
> if text/plain is indexable, but it then sends the extracted text along
> with the original mime type in my setup.
>
> My output connection:
> https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
>
> My job/pipeline configuration:
> https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
>
> History screenshot attached (hope that works on mailing lists...)
>
> My MCF Version is trunk (r1865689)
>
> Markus
>
>
> Am 23.08.2019 um 01:17 schrieb Karl Wright:
> > Hi Markus,
> >
> > If you use the straight update handler, with no Tika filter, then the
> > Solr Connector by design restricts input to textual documents.  We can
> > perhaps broaden that to web pages but then you will be indexing HTML
> > tags as well and I rather doubt that's what you want.
> >
> > If you run Tika within ManifoldCF, the mime type it presents to the
> > update handler is text/plain.
> > If you run via the extracting update handler, then there is no content
> > type check done by the Solr connector.
> >
> > Karl
> >
> >
> > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch  > > wrote:
> >
> > Hi,
> >
> > i am playing around with the solrj mode of the solr output
> connector, to
> > avoid running tika extraction in solr.
> >
> > My problem is, that the ingestion of web pages gets rejected with the
> > message
> >
> > "Solr connector rejected document due to mime type restrictions:
> > (text/html; charset=UTF-8)"
> >
> > My pipeline looks like this:
> >
> > 1) Webcrawler Connector (Repository Connection)
> > 2) Tika Extractor (Transformation)
> > 3) Solr Connector (Output Connection)
> >
> > The webserver returns content type "text/html; charset=UTF-8" for
> > the pages.
> >
> > The "Use extracting request handler" option is disabled in the solr
> > output connection.
> >
> > The mimetype inclusions in the solr output connector are:
> >
> > text/plain;charset=utf-8
> > text/html
> > text/html; charset=UTF-8
> >
> > I think the ingestion gets rejected by the HttpPoster, because it
> > performs a hard check that the mime type has to be a "text/plain*"
> type
> > (see acceptableMimeTypes in HttpPoster).
> >
> > The TikaExtractor asks if downstream pipeline accepts
> > "text/plain;charset=utf-8" as this is the result of the extraction.
> But
> > the sent RepositoryDocument still carries the original mimetype
> before
> > the extraction.
> >
> > Is this a bug or am i missing something?
> >
> > Many thanks in advance
> > Markus
> >
>


Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
Hi Karl,

what do i have to do to make tika declare the extracted plain text with
mime type text/plain in my setup?

As i said, i have a tika extractor in place:

Pipeline:
1) Webcrawler Connector (Repository Connection)
2) Tika Extractor (Transformation)
3) Solr Connector (Output Connection,
   Extracting Update Handler disabled)

This transformer does not populate the RepositoryDocument.setMimeType()
field with the value "text/plain". It just asks the downstream pipeline
if text/plain is indexable, but it then sends the extracted text along
with the original mime type in my setup.

My output connection:
https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d

My job/pipeline configuration:
https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b

History screenshot attached (hope that works on mailing lists...)

My MCF Version is trunk (r1865689)

Markus


Am 23.08.2019 um 01:17 schrieb Karl Wright:
> Hi Markus,
>
> If you use the straight update handler, with no Tika filter, then the
> Solr Connector by design restricts input to textual documents.  We can
> perhaps broaden that to web pages but then you will be indexing HTML
> tags as well and I rather doubt that's what you want.
>
> If you run Tika within ManifoldCF, the mime type it presents to the
> update handler is text/plain.
> If you run via the extracting update handler, then there is no content
> type check done by the Solr connector.
>
> Karl
>
>
> On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch  > wrote:
>
> Hi,
>
> i am playing around with the solrj mode of the solr output connector, to
> avoid running tika extraction in solr.
>
> My problem is, that the ingestion of web pages gets rejected with the
> message
>
>     "Solr connector rejected document due to mime type restrictions:
>     (text/html; charset=UTF-8)"
>
> My pipeline looks like this:
>
>     1) Webcrawler Connector (Repository Connection)
>     2) Tika Extractor (Transformation)
>     3) Solr Connector (Output Connection)
>
> The webserver returns content type "text/html; charset=UTF-8" for
> the pages.
>
> The "Use extracting request handler" option is disabled in the solr
> output connection.
>
> The mimetype inclusions in the solr output connector are:
>
>     text/plain;charset=utf-8
>     text/html
>     text/html; charset=UTF-8
>
> I think the ingestion gets rejected by the HttpPoster, because it
> performs a hard check that the mime type has to be a "text/plain*" type
> (see acceptableMimeTypes in HttpPoster).
>
> The TikaExtractor asks if downstream pipeline accepts
> "text/plain;charset=utf-8" as this is the result of the extraction. But
> the sent RepositoryDocument still carries the original mimetype before
> the extraction.
>
> Is this a bug or am i missing something?
>
> Many thanks in advance
> Markus
>


Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-22 Thread Karl Wright
Hi Markus,

If you use the straight update handler, with no Tika filter, then the Solr
Connector by design restricts input to textual documents.  We can perhaps
broaden that to web pages but then you will be indexing HTML tags as well
and I rather doubt that's what you want.

If you run Tika within ManifoldCF, the mime type it presents to the update
handler is text/plain.
If you run via the extracting update handler, then there is no content type
check done by the Solr connector.

Karl


On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch  wrote:

> Hi,
>
> i am playing around with the solrj mode of the solr output connector, to
> avoid running tika extraction in solr.
>
> My problem is, that the ingestion of web pages gets rejected with the
> message
>
> "Solr connector rejected document due to mime type restrictions:
> (text/html; charset=UTF-8)"
>
> My pipeline looks like this:
>
> 1) Webcrawler Connector (Repository Connection)
> 2) Tika Extractor (Transformation)
> 3) Solr Connector (Output Connection)
>
> The webserver returns content type "text/html; charset=UTF-8" for the
> pages.
>
> The "Use extracting request handler" option is disabled in the solr
> output connection.
>
> The mimetype inclusions in the solr output connector are:
>
> text/plain;charset=utf-8
> text/html
> text/html; charset=UTF-8
>
> I think the ingestion gets rejected by the HttpPoster, because it
> performs a hard check that the mime type has to be a "text/plain*" type
> (see acceptableMimeTypes in HttpPoster).
>
> The TikaExtractor asks if downstream pipeline accepts
> "text/plain;charset=utf-8" as this is the result of the extraction. But
> the sent RepositoryDocument still carries the original mimetype before
> the extraction.
>
> Is this a bug or am i missing something?
>
> Many thanks in advance
> Markus
>