Created a ticket: CONNECTORS-1621. Added a fix. Please let me know if it resolves the problem for you.
Thanks, Karl On Fri, Aug 23, 2019 at 7:33 AM Karl Wright <daddy...@gmail.com> wrote: > Hi Markus, > > You are correct. > This code was added as part of > https://issues.apache.org/jira/browse/CONNECTORS-1482 . The code that > was added does look at the content mime type. > > The reason that the mime type is not modified in the document being passed > to Solr by Tika is because we want Solr to receive the original mime type, > because that may be of interest at indexing time. So a filter specified in > the solr connector should always be against the original mime type and not > the modified one. > > Let me make that change. > > Karl > > > On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch <markus_sch...@web.de> > wrote: > >> I already have "update" in the handler field. One can see that in the >> gist link i posted and it is not working. >> >> The HttpPoster of the SolrConnector takes >> RepositoryDocument.getMimeType() and checks the mime type against the >> hardcoded plain text mime type list, if solr cell mode (extracting >> request handler mode) is disabled. >> >> I think org.apache.manifoldcf.agents.transformation.tika.TikaExtractor >> never calling setMimeType on the duplicated RepositoryDocument to set >> the MIME type to text/plain might be the source of my problem. >> >> Markus >> >> Am 23.08.2019 um 10:30 schrieb Karl Wright: >> > There are two possible ways to configure Tika with Solr. >> > First way: Tika extractor + Solr update handler >> > Second way: no Tika extractor + Solr update/extract handler >> > >> > For the first way, the Solr Connector completely ignores any "accepted >> > mime types" you set for it, and only accepts text/plain. For the second >> > way, what you set in the "accepted mime types" is used to filter out >> > what is being crawled. You NEVER include the charset, by the way, in >> > the mime type you specify; that's supposed to get stripped off by anyone >> > who passes it between connectors. >> > >> > Both of these have been extensively used by many others. >> > >> > So what you need to do is change to the solr Update handler, sounds to >> > me. That's not just unchecking the box, it is also entering "update" >> > rather than "update/extract" in the handler field. >> > >> > If you still use the update/extract handler, you are essentially >> > invoking Tika twice, which is why we don't really support this option >> > very well. But you should be able to just have it accept "text/plain" >> > and it should work. OR uncheck the box and it should just default to >> > allowing "text/plain" with no other options accepted. >> > >> > Karl >> > >> > >> > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch <markus_sch...@web.de >> > <mailto:markus_sch...@web.de>> wrote: >> > >> > Hi Karl, >> > >> > what do i have to do to make tika declare the extracted plain text >> with >> > mime type text/plain in my setup? >> > >> > As i said, i have a tika extractor in place: >> > >> > Pipeline: >> > 1) Webcrawler Connector (Repository Connection) >> > 2) Tika Extractor (Transformation) >> > 3) Solr Connector (Output Connection, >> > Extracting Update Handler disabled) >> > >> > This transformer does not populate the >> RepositoryDocument.setMimeType() >> > field with the value "text/plain". It just asks the downstream >> pipeline >> > if text/plain is indexable, but it then sends the extracted text >> along >> > with the original mime type in my setup. >> > >> > My output connection: >> > https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d >> > >> > My job/pipeline configuration: >> > https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b >> > >> > History screenshot attached (hope that works on mailing lists...) >> > >> > My MCF Version is trunk (r1865689) >> > >> > Markus >> > >> > >> > Am 23.08.2019 um 01:17 schrieb Karl Wright: >> > > Hi Markus, >> > > >> > > If you use the straight update handler, with no Tika filter, then >> the >> > > Solr Connector by design restricts input to textual documents. >> We can >> > > perhaps broaden that to web pages but then you will be indexing >> HTML >> > > tags as well and I rather doubt that's what you want. >> > > >> > > If you run Tika within ManifoldCF, the mime type it presents to >> the >> > > update handler is text/plain. >> > > If you run via the extracting update handler, then there is no >> content >> > > type check done by the Solr connector. >> > > >> > > Karl >> > > >> > > >> > > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch >> > <markus_sch...@web.de <mailto:markus_sch...@web.de> >> > > <mailto:markus_sch...@web.de <mailto:markus_sch...@web.de>>> >> wrote: >> > > >> > > Hi, >> > > >> > > i am playing around with the solrj mode of the solr output >> > connector, to >> > > avoid running tika extraction in solr. >> > > >> > > My problem is, that the ingestion of web pages gets rejected >> > with the >> > > message >> > > >> > > "Solr connector rejected document due to mime type >> > restrictions: >> > > (text/html; charset=UTF-8)" >> > > >> > > My pipeline looks like this: >> > > >> > > 1) Webcrawler Connector (Repository Connection) >> > > 2) Tika Extractor (Transformation) >> > > 3) Solr Connector (Output Connection) >> > > >> > > The webserver returns content type "text/html; charset=UTF-8" >> for >> > > the pages. >> > > >> > > The "Use extracting request handler" option is disabled in the >> > solr >> > > output connection. >> > > >> > > The mimetype inclusions in the solr output connector are: >> > > >> > > text/plain;charset=utf-8 >> > > text/html >> > > text/html; charset=UTF-8 >> > > >> > > I think the ingestion gets rejected by the HttpPoster, >> because it >> > > performs a hard check that the mime type has to be a >> > "text/plain*" type >> > > (see acceptableMimeTypes in HttpPoster). >> > > >> > > The TikaExtractor asks if downstream pipeline accepts >> > > "text/plain;charset=utf-8" as this is the result of the >> > extraction. But >> > > the sent RepositoryDocument still carries the original >> > mimetype before >> > > the extraction. >> > > >> > > Is this a bug or am i missing something? >> > > >> > > Many thanks in advance >> > > Markus >> > > >> > >> >