Re: Using Tika that comes with Solr 5.2
On Fri, 5 Feb 2016, Steven White wrote: I went over to Tika's home page and tried to figure out what are the JARs I need (so I don't have to use Tika's JARs that come with Solr). I looked around and couldn't find a "dist" of the JARs. There isn't one - it's expected that you'll be using Maven or Ivy or Gradle, and add the dependency on Tika through that (full details given on the site), and the tool will take care of fetching them for you. We provide single jars for the App and the Server, to make it easy to use those as standalone applications - they have the dependencies inlined You could probably grab the OSGi bundle and unpack that to get most of the jars, but using an automated build tool like Maven or Gradle would be simpler and less likely to go wrong! Nick
Re: Using Tika that comes with Solr 5.2
Good point Uwe. I went over to Tika's home page and tried to figure out what are the JARs I need (so I don't have to use Tika's JARs that come with Solr). I looked around and couldn't find a "dist" of the JARs. From my reading under Getting Started, it looks like I have to build Tika from source to get the JARs. Is this true or am I missing something? I would like to skip the build process and simply grab all the required JARs. Steve On Wed, Feb 3, 2016 at 12:28 PM, Uwe Schindler wrote: > Hi, > > > > Morphlines stuff is not needed at all. This is a Mapreduce/Hadoop > integration of Solr (see documentation) – mostly command line tools around > Solr and Hadoop. > > > > FYI: In Solr we don’t show the warnings, because otherwise the user would > get a lot of useless warnings. We may fix this when TIKA 2.0 comes with the > tika-parsers module split into multiple parts. In Solr we currently removed > all TIKA dependencies that conflict with the rest of Solr/Lucene or are not > useful for fulltext indexing. We only left those that are useful for > “fulltext” extraction (e.g. office document formats). But we have no > parsers for CLASS files (breaks, because of ASM conflict) or Netcdf files > (License issues in older versions of TIKA). > > > > I see no reason to show these warnings, because if you use Solr as > documented, it should work correctly. We no longer support running Solr > inside foreign Application Servers. So everything should work out of box. > > > > Uwe > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > *From:* Steven White [mailto:swhite4...@gmail.com] > *Sent:* Wednesday, February 03, 2016 5:44 PM > *To:* user@tika.apache.org > *Subject:* Re: Using Tika that comes with Solr 5.2 > > > > Nick, that would be a good think to do: changing Ignore to Warn otherwise > newcomers will have no clue why this isn't working. > > > > Another question to the team regarding this topic. > > > > I see JARs under solr\contrib\morphlines-cell\lib\ and > solr\contrib\morphlines-core\lib\ The ones under "morphlines-cell" there > are 2 files with "tika" as their name. My question is, do I need those for > general Tika usage? The README.txt clearly states "*Experimental*" but > doesn't say if I need them to use Tika. > > > > Steve > > > > > > On Wed, Feb 3, 2016 at 9:16 AM, Nick Burch wrote: > > On Wed, 3 Feb 2016, Uwe Schindler wrote: > > The reason for this behaviour is part of TIKA: If a parser cannot load > because of classes it refers to are missing, it is automatically disabled. > Because you missed the actual PDF/Powerpoint/… classes, this is what > happens for all those parsers. > > > I wonder if it might be worth SOLR changing their default Tika config from > Ignore to Warn, so that SOLR users (who probably aren't as clued up on how > it all works as the average Tika user) will get to find out more quickly > that they've missed something? > > Nick > > >
RE: Using Tika that comes with Solr 5.2
Hi, Morphlines stuff is not needed at all. This is a Mapreduce/Hadoop integration of Solr (see documentation) – mostly command line tools around Solr and Hadoop. FYI: In Solr we don’t show the warnings, because otherwise the user would get a lot of useless warnings. We may fix this when TIKA 2.0 comes with the tika-parsers module split into multiple parts. In Solr we currently removed all TIKA dependencies that conflict with the rest of Solr/Lucene or are not useful for fulltext indexing. We only left those that are useful for “fulltext” extraction (e.g. office document formats). But we have no parsers for CLASS files (breaks, because of ASM conflict) or Netcdf files (License issues in older versions of TIKA). I see no reason to show these warnings, because if you use Solr as documented, it should work correctly. We no longer support running Solr inside foreign Application Servers. So everything should work out of box. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen <http://www.thetaphi.de/> http://www.thetaphi.de eMail: u...@thetaphi.de From: Steven White [mailto:swhite4...@gmail.com] Sent: Wednesday, February 03, 2016 5:44 PM To: user@tika.apache.org Subject: Re: Using Tika that comes with Solr 5.2 Nick, that would be a good think to do: changing Ignore to Warn otherwise newcomers will have no clue why this isn't working. Another question to the team regarding this topic. I see JARs under solr\contrib\morphlines-cell\lib\ and solr\contrib\morphlines-core\lib\ The ones under "morphlines-cell" there are 2 files with "tika" as their name. My question is, do I need those for general Tika usage? The README.txt clearly states "*Experimental*" but doesn't say if I need them to use Tika. Steve On Wed, Feb 3, 2016 at 9:16 AM, Nick Burch mailto:apa...@gagravarr.org> > wrote: On Wed, 3 Feb 2016, Uwe Schindler wrote: The reason for this behaviour is part of TIKA: If a parser cannot load because of classes it refers to are missing, it is automatically disabled. Because you missed the actual PDF/Powerpoint/… classes, this is what happens for all those parsers. I wonder if it might be worth SOLR changing their default Tika config from Ignore to Warn, so that SOLR users (who probably aren't as clued up on how it all works as the average Tika user) will get to find out more quickly that they've missed something? Nick
Re: Using Tika that comes with Solr 5.2
Nick, that would be a good think to do: changing Ignore to Warn otherwise newcomers will have no clue why this isn't working. Another question to the team regarding this topic. I see JARs under solr\contrib\morphlines-cell\lib\ and solr\contrib\morphlines-core\lib\ The ones under "morphlines-cell" there are 2 files with "tika" as their name. My question is, do I need those for general Tika usage? The README.txt clearly states "*Experimental*" but doesn't say if I need them to use Tika. Steve On Wed, Feb 3, 2016 at 9:16 AM, Nick Burch wrote: > On Wed, 3 Feb 2016, Uwe Schindler wrote: > >> The reason for this behaviour is part of TIKA: If a parser cannot load >> because of classes it refers to are missing, it is automatically disabled. >> Because you missed the actual PDF/Powerpoint/… classes, this is what >> happens for all those parsers. >> > > I wonder if it might be worth SOLR changing their default Tika config from > Ignore to Warn, so that SOLR users (who probably aren't as clued up on how > it all works as the average Tika user) will get to find out more quickly > that they've missed something? > > Nick
RE: Using Tika that comes with Solr 5.2
On Wed, 3 Feb 2016, Uwe Schindler wrote: The reason for this behaviour is part of TIKA: If a parser cannot load because of classes it refers to are missing, it is automatically disabled. Because you missed the actual PDF/Powerpoint/… classes, this is what happens for all those parsers. I wonder if it might be worth SOLR changing their default Tika config from Ignore to Warn, so that SOLR users (who probably aren't as clued up on how it all works as the average Tika user) will get to find out more quickly that they've missed something? Nick
RE: Using Tika that comes with Solr 5.2
Hi, The reason for this behaviour is part of TIKA: If a parser cannot load because of classes it refers to are missing, it is automatically disabled. Because you missed the actual PDF/Powerpoint/… classes, this is what happens for all those parsers. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen <http://www.thetaphi.de/> http://www.thetaphi.de eMail: u...@thetaphi.de From: Steven White [mailto:swhite4...@gmail.com] Sent: Wednesday, February 03, 2016 2:48 PM To: user@tika.apache.org Subject: Re: Using Tika that comes with Solr 5.2 Thanks everyone. After posting about this issue, I found my issue. I was missing a whole set of Tika JARs that are found under Solr: \solr\contrib\extraction\lib\ Steve On Wed, Feb 3, 2016 at 8:29 AM, Nick Burch mailto:apa...@gagravarr.org> > wrote: On Tue, 2 Feb 2016, Steven White wrote: What I'm finding is that Tika will not extract the raw text off PDF, Powerpoint, ets. files but it will off raw text files. I'd suggest you try some of the steps in the troubleshooting page: http://wiki.apache.org/tika/Troubleshooting%20Tika Probably start at the "No Content Extracted" section, and follow the links to the possible problems + ways to check Solr 5.2 comes with the following Tika JARs which I have included all of them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and kite-morphlines-tika-decompress-0.12.1.jar You seem to be missing quite a few of the Tika dependencies, which may well be it, follow the troubleshooting guide to check! Nick
Re: Using Tika that comes with Solr 5.2
Thanks everyone. After posting about this issue, I found my issue. I was missing a whole set of Tika JARs that are found under Solr: \solr\contrib\extraction\lib\ Steve On Wed, Feb 3, 2016 at 8:29 AM, Nick Burch wrote: > On Tue, 2 Feb 2016, Steven White wrote: > >> What I'm finding is that Tika will not extract the raw text off PDF, >> Powerpoint, ets. files but it will off raw text files. >> > > I'd suggest you try some of the steps in the troubleshooting page: > http://wiki.apache.org/tika/Troubleshooting%20Tika > Probably start at the "No Content Extracted" section, and follow the links > to the possible problems + ways to check > > Solr 5.2 comes with the following Tika JARs which I have included all of >> them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, >> tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, >> kite-morphlines-tika-core-0.12.1.jar and >> kite-morphlines-tika-decompress-0.12.1.jar >> > > You seem to be missing quite a few of the Tika dependencies, which may > well be it, follow the troubleshooting guide to check! > > Nick >
Re: Using Tika that comes with Solr 5.2
On Tue, 2 Feb 2016, Steven White wrote: What I'm finding is that Tika will not extract the raw text off PDF, Powerpoint, ets. files but it will off raw text files. I'd suggest you try some of the steps in the troubleshooting page: http://wiki.apache.org/tika/Troubleshooting%20Tika Probably start at the "No Content Extracted" section, and follow the links to the possible problems + ways to check Solr 5.2 comes with the following Tika JARs which I have included all of them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and kite-morphlines-tika-decompress-0.12.1.jar You seem to be missing quite a few of the Tika dependencies, which may well be it, follow the troubleshooting guide to check! Nick
RE: Using Tika that comes with Solr 5.2
The problem (I think) is that tika-parsers.jar includes just the Tika parsers (wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc). If you are using jars, I’d recommend the tika-app.jar which includes all dependencies. From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, February 02, 2016 7:01 PM To: user@tika.apache.org Subject: Using Tika that comes with Solr 5.2 Hi everyone, I have written a standalone application that works with Solr 5.2. I'm using the existing JARs that come with Solr to index data off a file system. My applications scans the file system, looking for files and then uses Tika to extract the raw text and then sends the raw text to Solr, using SolrJ, for indexing. What I'm finding is that Tika will not extract the raw text off PDF, Powerpoint, ets. files but it will off raw text files. Here is the code for: public static void parseWithTika() throws Exception { File file = new File("C:\\temp\\test.pdf"); FileInputStream in =- new FileInputStream(file); AutoDetectParser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); BodyContentHandler contentHandler = new BodyContentHandler(); parse.parse(in, contentHandler, metadata); String content = contentHandelr.toString(); <=== 'content is always an empty string in.close(); } In the above code, 'content' is always empty (the above is: off https://tika.apache.org/1.8/examples.html) Solr 5.2 comes with the following Tika JARs which I have included all of them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and kite-morphlines-tika-decompress-0.12.1.jar Any idea why this isn't working? Thanks! Steve