Re: HTML sample.html not indexing in Solr 8.8

Shawn Heisey Sat, 20 Feb 2021 15:20:59 -0800

On 2/20/2021 3:58 PM, cratervoid wrote:

SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html

The problem here is that the solrconfig.xml in use by the index named"gettingstarted" does not define a handler at /update/extract.

Typically a handler defined at that URL path will utilize the extractingrequest handler class. This handler uses Tika (another Apache project)to extract usable data from rich text formats like PDF, HTML, etc.


  <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler

    -->
  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>

Note that using this handler will require adding some contrib jars to Solr.

Tika can become very unstable because it deals with undocumented fileformats, so we do not recommend using that handler in production. Ifthe functionality is important, Tika should be included in a programthat's separate from Solr, so that if it crashes, it does not take Solrdown with it.


Thanks,
Shawn

Re: HTML sample.html not indexing in Solr 8.8

Reply via email to