Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika 1.11.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, May 4, 2016 10:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Integrating grobid with Tika in solr

I think Solr is using a version of Tika that predates that addition of the 
Grobid parser.  You'll have to add that manually somehow until Solr upgrades to 
Tika 1.13 (soon to be released...I think).  SOLR-8981.

-----Original Message-----
From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] 
Sent: Wednesday, May 4, 2016 10:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Integrating grobid with Tika in solr

Grobid runs as a service, and I'm (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika 
>application:

First we need to create the GrobidExtractor.properties file that points to the 
Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI 
--config=$HOME/src/grobidparser-resources/tika-config.xml -J 
$HOME/src/grobid/papers/ICSE06.pdf

Here's the stack trace.

<lst name="error"><lst name="metadata"><str 
name="error-class">org.apache.solr.common.SolrException</str><str 
name="root-error-class">java.lang.ClassNotFoundException</str></lst><str 
name="msg">org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser</str><str 
name="trace">org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unable to find a parser class: 
org.apache.tika.parser.journal.JournalParser
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at 
org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at 
org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at 
org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:92)
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method) at 
java.lang.Class.forName(Class.java:348)
at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:189)
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:338)
... 35 more
</str><int name="code">500</int></lst>



On 5/4/16, 10:00 AM, "Shawn Heisey" 
<apa...@elyograg.org<mailto:apa...@elyograg.org>> wrote:

On 5/4/2016 7:15 AM, Betsey Benagh wrote:
(X-posted from stack overflow)
This feels like a basic, dumb question, but my reading of the documentation has 
not led me to an answer.
i'm using Solr to index journal articles. Using the out-of-the-box 
configuration, it indexed the text of the documents, but I'm looking to use 
Grobid to pull out the authors, title, affiliations, etc. I got grobid up and 
running as a service.
I added
<str name="tika.config">/path/to/tika-config.xml</str>
to the requestHandler for /update/extract in solrconfig.xml The tika-config 
looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <properties>
   <parsers>
     <parser class="org.apache.tika.parser.journal.JournalParser">
       <mime>application/pdf</mime>
     </parser>
   </parsers>
</properties>
I'm getting a ClassNotFound exception when I try to import a document, but 
can't figure out where to set the classpath to fix it.

I do not know anything about grobid.

We'll need to see the exception -- the entire multi-line stacktrace, including 
any "caused by" sections.

In general, you should create a lib directory in the solr home and place all 
extra jars in that directory.  Otherwise you need <lib> elements in 
solrconfig.xml to load jars -- and they will be loaded once for every core that 
uses that <lib> element.  ${solr.solr.home}/lib loads jars
*once* when Solr starts and makes them available to all cores.

Thanks,
Shawn


Reply via email to