FYI here:
http://wiki.apache.org/tika/GrobidJournalParser From: "tesm...@gmail.com" <tesm...@gmail.com> Reply-To: "user@tika.apache.org" <user@tika.apache.org> Date: Thursday, May 4, 2017 at 8:38 AM To: "user@tika.apache.org" <user@tika.apache.org> Cc: "thammego...@apache.org" <thammego...@apache.org> Subject: Re: Analysing a document sections with Apache Tika Dear Thamme, Thanks for your reply and the suggestions. I build Grobid usign the instruction from http://grobid.readthedocs.io/en/latest/Install-Grobid/ Trying to run the following example code from GitHub repository(https://github.com/kermitt2/grobid-example) ================= import org.grobid.core.*; import org.grobid.core.data.*; import org.grobid.core.factory.*; import org.grobid.core.mock.*; import org.grobid.core.utilities.*; import org.grobid.core.engines.Engine; public class GrobidTest { public GrobidTest() { // TODO Auto-generated constructor stub } public static void main(String[] args) { run("D:/Eclipse-Workspace/PDFs/Train/6.pdf"); } public static void run(String faFileName) { String pdfPath =faFileName; try { String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home"; String pGrobidProperties = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home/config/grobid.properties"; MockContext.setInitialContext(pGrobidHome, pGrobidProperties); GrobidProperties.getInstance(); System.out.println(">>>>>>>> GROBID_HOME="+GrobidProperties.get_GROBID_HOME_PATH()); Engine engine = GrobidFactory.getInstance().createEngine(); // Biblio object for the result BiblioItem resHeader = new BiblioItem(); String tei = engine.processHeader(pdfPath, false, resHeader); } catch (Exception e) { // If an exception is generated, print a stack trace e.printStackTrace(); } finally { try { MockContext.destroyInitialContext(); } catch (Exception e) { e.printStackTrace(); } } } } ================ Gettign the following exception: javax.naming.NoInitialContextException: Cannot instantiate class: org.apache.naming.java.javaURLContextFactory [Root exception is java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory] at javax.naming.spi.NamingManager.getInitialContext(Unknown Source) at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source) at javax.naming.InitialContext.init(Unknown Source) at javax.naming.InitialContext.<init>(Unknown Source) at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:36) at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76) at GrobidTest.run(GrobidTest.java:28) at GrobidTest.main(GrobidTest.java:17) Caused by: java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) ... 8 more javax.naming.NoInitialContextException: Cannot instantiate class: org.apache.naming.java.javaURLContextFactory [Root exception is java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory] at javax.naming.spi.NamingManager.getInitialContext(Unknown Source) at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source) at javax.naming.InitialContext.init(Unknown Source) at javax.naming.InitialContext.<init>(Unknown Source) at org.grobid.core.mock.MockContext.destroyInitialContext(MockContext.java:105) at GrobidTest.run(GrobidTest.java:45) at GrobidTest.main(GrobidTest.java:17) Caused by: java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) ... 7 more On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <thammego...@apache.org> wrote: Hello, There is a nice project called Grobid [1] that does most of what you are describing. Tika has Grobid parser built in (it calls grobid over REST API) . checkout [2] for details I have a project that makes use of Tika with Grobid and NER support. It also builds a search index using solr. Check out [3] for setting up and [4] for parsing and indexing to solr if you like to try out my python project. Here I am able to extract title, author names, affiliations, and the whole text of articles. I did not extract sections within the main body of research articles. I assume there should be a way to configure it in Grobid. Alternatively, if Grobid can't detect sections, you can try XHTML content handler which preserves the basic structure of PDF file using <p> <br> and heading tags. So technically it should be possible to write a wrapper to break XHTML output from tika into sections To get it: # In bash do `pip install tika’ if tika isn’t already installed import tika tika.initVM() from tika import parser file_path = "<pdf_dir>/2538.pdf" data = parser.from_file(file_path, xmlContent=True) print(data['content']) Best, Thamme [1] http://grobid.readthedocs.io/en/latest/Introduction/ [2] https://wiki.apache.org/tika/GrobidJournalParser [3] https://github.com/USCDataScience/parser-indexer-py/tree/master/parser-server [4] https://github.com/USCDataScience/parser-indexer-py/blob/master/docs/parser-index-journals.md -- Thamme Gowda TG | @thammegowda ~Sent via somebody's Webmail server! On Wed, May 3, 2017 at 9:34 AM, tesm...@gmail.com <tesm...@gmail.com> wrote: Hi, I am working with published research articles using Apache Tika. These articles have distinct sections like abstract, introduction, literature review, methodology, experimental setup, discussion and conclusions. Is there some way to extract document sections with Apache Tika Regards,