Hi everyone,

I have written a standalone application that works with Solr 5.2.  I'm
using the existing JARs that come with Solr to index data off a file
system.  My applications scans the file system, looking for files and then
uses Tika to extract the raw text and then sends the raw text to Solr,
using SolrJ, for indexing.

What I'm finding is that Tika will not extract the raw text off PDF,
Powerpoint, ets. files but it will off raw text files.

Here is the code for:

public static void parseWithTika() throws Exception {
  File file = new File("C:\\temp\\test.pdf");

  FileInputStream in =- new FileInputStream(file);
  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  BodyContentHandler contentHandler = new BodyContentHandler();

  parse.parse(in, contentHandler, metadata);

  String content = contentHandelr.toString();  <=== 'content is always an
empty string

  in.close();
}

In the above code, 'content' is always empty (the above is: off
https://tika.apache.org/1.8/examples.html)

Solr 5.2 comes with the following Tika JARs which I have included all of
them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar,
tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar,
kite-morphlines-tika-core-0.12.1.jar and
kite-morphlines-tika-decompress-0.12.1.jar

Any idea why this isn't working?

Thanks!

Steve

Reply via email to