Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11. -----Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, May 4, 2016 10:29 AM To: solr-user@lucene.apache.org Subject: RE: Integrating grobid with Tika in solr
I think Solr is using a version of Tika that predates that addition of the Grobid parser. You'll have to add that manually somehow until Solr upgrades to Tika 1.13 (soon to be released...I think). SOLR-8981. -----Original Message----- From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] Sent: Wednesday, May 4, 2016 10:07 AM To: solr-user@lucene.apache.org Subject: Re: Integrating grobid with Tika in solr Grobid runs as a service, and I'm (theoretically) configuring Tika to call it. >From the Grobid wiki, here are instructions for integrating with Tika >application: First we need to create the GrobidExtractor.properties file that points to the Grobid REST Service. My file looks like the following: grobid.server.url=http://localhost:[port] Now you can run GROBID via Tika-app with the following command on a sample PDF file. java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf Here's the stack trace. <lst name="error"><lst name="metadata"><str name="error-class">org.apache.solr.common.SolrException</str><str name="root-error-class">java.lang.ClassNotFoundException</str></lst><str name="msg">org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser</str><str name="trace">org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82) at org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367) at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348) at org.apache.solr.core.PluginBag.get(PluginBag.java:148) at org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231) at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362) at org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326) at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:499) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362) at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:127) at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:115) at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:111) at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:92) at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80) ... 30 more Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.journal.JournalParser at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:189) at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:338) ... 35 more </str><int name="code">500</int></lst> On 5/4/16, 10:00 AM, "Shawn Heisey" <apa...@elyograg.org<mailto:apa...@elyograg.org>> wrote: On 5/4/2016 7:15 AM, Betsey Benagh wrote: (X-posted from stack overflow) This feels like a basic, dumb question, but my reading of the documentation has not led me to an answer. i'm using Solr to index journal articles. Using the out-of-the-box configuration, it indexed the text of the documents, but I'm looking to use Grobid to pull out the authors, title, affiliations, etc. I got grobid up and running as a service. I added <str name="tika.config">/path/to/tika-config.xml</str> to the requestHandler for /update/extract in solrconfig.xml The tika-config looks like: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <properties> <parsers> <parser class="org.apache.tika.parser.journal.JournalParser"> <mime>application/pdf</mime> </parser> </parsers> </properties> I'm getting a ClassNotFound exception when I try to import a document, but can't figure out where to set the classpath to fix it. I do not know anything about grobid. We'll need to see the exception -- the entire multi-line stacktrace, including any "caused by" sections. In general, you should create a lib directory in the solr home and place all extra jars in that directory. Otherwise you need <lib> elements in solrconfig.xml to load jars -- and they will be loaded once for every core that uses that <lib> element. ${solr.solr.home}/lib loads jars *once* when Solr starts and makes them available to all cores. Thanks, Shawn