On 2/20/2021 3:58 PM, cratervoid wrote:
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html&literal.id=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html

The problem here is that the solrconfig.xml in use by the index named "gettingstarted" does not define a handler at /update/extract.

Typically a handler defined at that URL path will utilize the extracting request handler class. This handler uses Tika (another Apache project) to extract usable data from rich text formats like PDF, HTML, etc.

  <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler

    -->
  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>

Note that using this handler will require adding some contrib jars to Solr.

Tika can become very unstable because it deals with undocumented file formats, so we do not recommend using that handler in production. If the functionality is important, Tika should be included in a program that's separate from Solr, so that if it crashes, it does not take Solr down with it.

Thanks,
Shawn

Reply via email to