Hi Karl,
yes, this helps.
The webpage is now ingested after tika extraction and i only have to
include the mime type text/html in the solr output connection.
Many thanks.
Cheers
Markus
Am 23.08.2019 um 13:45 schrieb Karl Wright:
> Created a ticket: CONNECTORS-1621. Added a fix. Please let
Created a ticket: CONNECTORS-1621. Added a fix. Please let me know if it
resolves the problem for you.
Thanks,
Karl
On Fri, Aug 23, 2019 at 7:33 AM Karl Wright wrote:
> Hi Markus,
>
> You are correct.
> This code was added as part of
> https://issues.apache.org/jira/browse/CONNECTORS-1482 .
Hi Markus,
You are correct.
This code was added as part of
https://issues.apache.org/jira/browse/CONNECTORS-1482 . The code that was
added does look at the content mime type.
The reason that the mime type is not modified in the document being passed
to Solr by Tika is because we want Solr to
I already have "update" in the handler field. One can see that in the
gist link i posted and it is not working.
The HttpPoster of the SolrConnector takes
RepositoryDocument.getMimeType() and checks the mime type against the
hardcoded plain text mime type list, if solr cell mode (extracting
There are two possible ways to configure Tika with Solr.
First way: Tika extractor + Solr update handler
Second way: no Tika extractor + Solr update/extract handler
For the first way, the Solr Connector completely ignores any "accepted mime
types" you set for it, and only accepts text/plain. For
Hi Karl,
what do i have to do to make tika declare the extracted plain text with
mime type text/plain in my setup?
As i said, i have a tika extractor in place:
Pipeline:
1) Webcrawler Connector (Repository Connection)
2) Tika Extractor (Transformation)
3) Solr Connector (Output
Hi Markus,
If you use the straight update handler, with no Tika filter, then the Solr
Connector by design restricts input to textual documents. We can perhaps
broaden that to web pages but then you will be indexing HTML tags as well
and I rather doubt that's what you want.
If you run Tika