Re: Analysing a document sections with Apache Tika

Chris Mattmann Thu, 04 May 2017 08:47:48 -0700

FYI here:


http://wiki.apache.org/tika/GrobidJournalParser 

 

 

 

From: "[email protected]" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, May 4, 2017 at 8:38 AM
To: "[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: Analysing a document sections with Apache Tika

 

Dear Thamme, 

 

Thanks for your reply and the suggestions.

 

I build Grobid usign the instruction from 
http://grobid.readthedocs.io/en/latest/Install-Grobid/

Trying to run the following example code from GitHub 
repository(https://github.com/kermitt2/grobid-example)

=================

 

 import org.grobid.core.*;

    import org.grobid.core.data.*;

    import org.grobid.core.factory.*;

    import org.grobid.core.mock.*;

    import org.grobid.core.utilities.*;

    import org.grobid.core.engines.Engine;

 

public class GrobidTest {

 

public GrobidTest() {

// TODO Auto-generated constructor stub

}

public static void main(String[] args)

 

{

run("D:/Eclipse-Workspace/PDFs/Train/6.pdf");

}

public static void run(String faFileName)

{

String pdfPath =faFileName;

  

try {

String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home";

String pGrobidProperties = 
"D:/Eclipse-Workspace/Libraries/Grobid/grobid-home/config/grobid.properties";

 

MockContext.setInitialContext(pGrobidHome, pGrobidProperties);

 

GrobidProperties.getInstance();

 

System.out.println(">>>>>>>> 
GROBID_HOME="+GrobidProperties.get_GROBID_HOME_PATH());

 

Engine engine = GrobidFactory.getInstance().createEngine();

 

// Biblio object for the result

BiblioItem resHeader = new BiblioItem();

String tei = engine.processHeader(pdfPath, false, resHeader);

} 

catch (Exception e) {

// If an exception is generated, print a stack trace

e.printStackTrace();

} 

finally {

try {

MockContext.destroyInitialContext();

} 

catch (Exception e) {

e.printStackTrace();

}

}

}

 

}

 

================

 

Gettign the following exception:

 

javax.naming.NoInitialContextException: Cannot instantiate class: 
org.apache.naming.java.javaURLContextFactory [Root exception is 
java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory]

at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)

at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)

at javax.naming.InitialContext.init(Unknown Source)

at javax.naming.InitialContext.<init>(Unknown Source)

at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:36)

at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76)

at GrobidTest.run(GrobidTest.java:28)

at GrobidTest.main(GrobidTest.java:17)

Caused by: java.lang.ClassNotFoundException: 
org.apache.naming.java.javaURLContextFactory

at java.net.URLClassLoader.findClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

... 8 more

javax.naming.NoInitialContextException: Cannot instantiate class: 
org.apache.naming.java.javaURLContextFactory [Root exception is 
java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory]

at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)

at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)

at javax.naming.InitialContext.init(Unknown Source)

at javax.naming.InitialContext.<init>(Unknown Source)

at org.grobid.core.mock.MockContext.destroyInitialContext(MockContext.java:105)

at GrobidTest.run(GrobidTest.java:45)

at GrobidTest.main(GrobidTest.java:17)

Caused by: java.lang.ClassNotFoundException: 
org.apache.naming.java.javaURLContextFactory

at java.net.URLClassLoader.findClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

... 7 more

 

 

 

 

On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <[email protected]> wrote:

Hello, 

 

There is a nice project called Grobid [1] that does most of what you are 
describing.

Tika has Grobid parser built in (it calls grobid over REST API) . checkout [2] 
for details

 

I have a project that makes use of Tika with Grobid and NER support. It also 
builds a search index using solr. 

Check out [3] for setting up and [4] for parsing and indexing to solr if you 
like to try out my python project.

Here I am able to extract title, author names, affiliations, and the whole text 
of articles. 

I did not extract sections within the main body of research articles.  I assume 
there should be a way to configure it in Grobid.

 

Alternatively, if Grobid can't detect sections, you can try XHTML content 
handler which preserves the basic structure of PDF file using <p>  <br> and 
heading tags. So technically it should be possible to write a wrapper to break 
XHTML output from tika into sections

 

To get it:

# In bash do `pip install tika’ if tika isn’t already installed

import tika

tika.initVM()

from tika import parser

 

 

file_path = "<pdf_dir>/2538.pdf"

data = parser.from_file(file_path, xmlContent=True)

print(data['content'])

 

 

 

Best,

Thamme

 

[1] http://grobid.readthedocs.io/en/latest/Introduction/

[2] https://wiki.apache.org/tika/GrobidJournalParser

[3] 
https://github.com/USCDataScience/parser-indexer-py/tree/master/parser-server

[4] 
https://github.com/USCDataScience/parser-indexer-py/blob/master/docs/parser-index-journals.md
 


--

Thamme Gowda

TG | @thammegowda 

~Sent via somebody's Webmail server!

 

On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]> wrote:

Hi, 

 

I am working with published research articles using Apache Tika. These articles 
have distinct sections like abstract, introduction, literature review, 
methodology, experimental setup, discussion and conclusions. Is there some way 
to extract document sections with Apache Tika

 

Regards,

Re: Analysing a document sections with Apache Tika

Reply via email to