subject:"Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode"

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch

Hi Karl, yes, this helps. The webpage is now ingested after tika extraction and i only have to include the mime type text/html in the solr output connection. Many thanks. Cheers Markus Am 23.08.2019 um 13:45 schrieb Karl Wright: > Created a ticket: CONNECTORS-1621. Added a fix. Please let

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright

Created a ticket: CONNECTORS-1621. Added a fix. Please let me know if it resolves the problem for you. Thanks, Karl On Fri, Aug 23, 2019 at 7:33 AM Karl Wright wrote: > Hi Markus, > > You are correct. > This code was added as part of > https://issues.apache.org/jira/browse/CONNECTORS-1482 .

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright

Hi Markus, You are correct. This code was added as part of https://issues.apache.org/jira/browse/CONNECTORS-1482 . The code that was added does look at the content mime type. The reason that the mime type is not modified in the document being passed to Solr by Tika is because we want Solr to

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch

I already have "update" in the handler field. One can see that in the gist link i posted and it is not working. The HttpPoster of the SolrConnector takes RepositoryDocument.getMimeType() and checks the mime type against the hardcoded plain text mime type list, if solr cell mode (extracting

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright

There are two possible ways to configure Tika with Solr. First way: Tika extractor + Solr update handler Second way: no Tika extractor + Solr update/extract handler For the first way, the Solr Connector completely ignores any "accepted mime types" you set for it, and only accepts text/plain. For

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch

Hi Karl, what do i have to do to make tika declare the extracted plain text with mime type text/plain in my setup? As i said, i have a tika extractor in place: Pipeline: 1) Webcrawler Connector (Repository Connection) 2) Tika Extractor (Transformation) 3) Solr Connector (Output

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-22 Thread Karl Wright

Hi Markus, If you use the straight update handler, with no Tika filter, then the Solr Connector by design restricts input to textual documents. We can perhaps broaden that to web pages but then you will be indexing HTML tags as well and I rather doubt that's what you want. If you run Tika

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

7 matches

Site Navigation

Mail list logo

Footer information