Hi everyone, I have written a standalone application that works with Solr 5.2. I'm using the existing JARs that come with Solr to index data off a file system. My applications scans the file system, looking for files and then uses Tika to extract the raw text and then sends the raw text to Solr, using SolrJ, for indexing.
What I'm finding is that Tika will not extract the raw text off PDF, Powerpoint, ets. files but it will off raw text files. Here is the code for: public static void parseWithTika() throws Exception { File file = new File("C:\\temp\\test.pdf"); FileInputStream in =- new FileInputStream(file); AutoDetectParser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); BodyContentHandler contentHandler = new BodyContentHandler(); parse.parse(in, contentHandler, metadata); String content = contentHandelr.toString(); <=== 'content is always an empty string in.close(); } In the above code, 'content' is always empty (the above is: off https://tika.apache.org/1.8/examples.html) Solr 5.2 comes with the following Tika JARs which I have included all of them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and kite-morphlines-tika-decompress-0.12.1.jar Any idea why this isn't working? Thanks! Steve