Dear Thamme, Thanks for your reply and the suggestions.
I build Grobid usign the instruction from http://grobid.readthedocs.io/en/latest/Install-Grobid/ Trying to run the following example code from GitHub repository( https://github.com/kermitt2/grobid-example) ================= import org.grobid.core.*; import org.grobid.core.data.*; import org.grobid.core.factory.*; import org.grobid.core.mock.*; import org.grobid.core.utilities.*; import org.grobid.core.engines.Engine; public class GrobidTest { public GrobidTest() { // TODO Auto-generated constructor stub } public static void main(String[] args) { run("D:/Eclipse-Workspace/PDFs/Train/6.pdf"); } public static void run(String faFileName) { String pdfPath =faFileName; try { String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home"; String pGrobidProperties = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home/config/grobid.properties"; MockContext.setInitialContext(pGrobidHome, pGrobidProperties); GrobidProperties.getInstance(); System.out.println(">>>>>>>> GROBID_HOME="+GrobidProperties.get_GROBID_HOME_PATH()); Engine engine = GrobidFactory.getInstance().createEngine(); // Biblio object for the result BiblioItem resHeader = new BiblioItem(); String tei = engine.processHeader(pdfPath, false, resHeader); } catch (Exception e) { // If an exception is generated, print a stack trace e.printStackTrace(); } finally { try { MockContext.destroyInitialContext(); } catch (Exception e) { e.printStackTrace(); } } } } ================ Gettign the following exception: javax.naming.NoInitialContextException: Cannot instantiate class: org.apache.naming.java.javaURLContextFactory [Root exception is java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory] at javax.naming.spi.NamingManager.getInitialContext(Unknown Source) at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source) at javax.naming.InitialContext.init(Unknown Source) at javax.naming.InitialContext.<init>(Unknown Source) at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:36) at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76) at GrobidTest.run(GrobidTest.java:28) at GrobidTest.main(GrobidTest.java:17) Caused by: java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) ... 8 more javax.naming.NoInitialContextException: Cannot instantiate class: org.apache.naming.java.javaURLContextFactory [Root exception is java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory] at javax.naming.spi.NamingManager.getInitialContext(Unknown Source) at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source) at javax.naming.InitialContext.init(Unknown Source) at javax.naming.InitialContext.<init>(Unknown Source) at org.grobid.core.mock.MockContext.destroyInitialContext(MockContext.java:105) at GrobidTest.run(GrobidTest.java:45) at GrobidTest.main(GrobidTest.java:17) Caused by: java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) ... 7 more On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <thammego...@apache.org> wrote: > Hello, > > There is a nice project called Grobid [1] that does most of what you are > describing. > Tika has Grobid parser built in (it calls grobid over REST API) . checkout > [2] for details > > I have a project that makes use of Tika with Grobid and NER support. It > also builds a search index using solr. > Check out [3] for setting up and [4] for parsing and indexing to solr if > you like to try out my python project. > Here I am able to extract title, author names, affiliations, and the whole > text of articles. > I did not extract sections within the main body of research articles. I > assume there should be a way to configure it in Grobid. > > Alternatively, if Grobid can't detect sections, you can try XHTML content > handler which preserves the basic structure of PDF file using <p> <br> and > heading tags. So technically it should be possible to write a wrapper to > break XHTML output from tika into sections > > To get it: > > # In bash do `pip install tika’ if tika isn’t already installed > import tika > tika.initVM() > from tika import parser > > > file_path = "<pdf_dir>/2538.pdf" > data = parser.from_file(file_path, xmlContent=True) > print(data['content']) > > > > > Best, > Thamme > > [1] http://grobid.readthedocs.io/en/latest/Introduction/ > [2] https://wiki.apache.org/tika/GrobidJournalParser > [3] https://github.com/USCDataScience/parser-indexer- > py/tree/master/parser-server > [4] https://github.com/USCDataScience/parser-indexer- > py/blob/master/docs/parser-index-journals.md > > *--* > *Thamme Gowda* > TG | @thammegowda <https://twitter.com/thammegowda> > ~Sent via somebody's Webmail server! > > On Wed, May 3, 2017 at 9:34 AM, tesm...@gmail.com <tesm...@gmail.com> > wrote: > >> Hi, >> >> I am working with published research articles using Apache Tika. These >> articles have distinct sections like abstract, introduction, literature >> review, methodology, experimental setup, discussion and conclusions. Is >> there some way to extract document sections with Apache Tika >> >> Regards, >> > >