Re: Analysing a document sections with Apache Tika

[email protected] Thu, 04 May 2017 08:39:06 -0700

Dear Thamme,

Thanks for your reply and the suggestions.

I build Grobid usign the instruction from
http://grobid.readthedocs.io/en/latest/Install-Grobid/
Trying to run the following example code from GitHub repository(
https://github.com/kermitt2/grobid-example)
=================

 import org.grobid.core.*;
    import org.grobid.core.data.*;
    import org.grobid.core.factory.*;
    import org.grobid.core.mock.*;
    import org.grobid.core.utilities.*;
    import org.grobid.core.engines.Engine;

public class GrobidTest {

public GrobidTest() {
// TODO Auto-generated constructor stub
}
public static void main(String[] args)
{
run("D:/Eclipse-Workspace/PDFs/Train/6.pdf");
}
public static void run(String faFileName)
{
String pdfPath =faFileName;

try {
String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home";
String pGrobidProperties =
"D:/Eclipse-Workspace/Libraries/Grobid/grobid-home/config/grobid.properties";

MockContext.setInitialContext(pGrobidHome, pGrobidProperties);
GrobidProperties.getInstance();

System.out.println(">>>>>>>>
GROBID_HOME="+GrobidProperties.get_GROBID_HOME_PATH());

Engine engine = GrobidFactory.getInstance().createEngine();

// Biblio object for the result
BiblioItem resHeader = new BiblioItem();
String tei = engine.processHeader(pdfPath, false, resHeader);
}
catch (Exception e) {
// If an exception is generated, print a stack trace
e.printStackTrace();
}
finally {
try {
MockContext.destroyInitialContext();
}
catch (Exception e) {
e.printStackTrace();
}
}
}

}

================

Gettign the following exception:

javax.naming.NoInitialContextException: Cannot instantiate class:
org.apache.naming.java.javaURLContextFactory [Root exception is
java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory]
at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
at javax.naming.InitialContext.init(Unknown Source)
at javax.naming.InitialContext.<init>(Unknown Source)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:36)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76)
at GrobidTest.run(GrobidTest.java:28)
at GrobidTest.main(GrobidTest.java:17)
Caused by: java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
... 8 more
javax.naming.NoInitialContextException: Cannot instantiate class:
org.apache.naming.java.javaURLContextFactory [Root exception is
java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory]
at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
at javax.naming.InitialContext.init(Unknown Source)
at javax.naming.InitialContext.<init>(Unknown Source)
at
org.grobid.core.mock.MockContext.destroyInitialContext(MockContext.java:105)
at GrobidTest.run(GrobidTest.java:45)
at GrobidTest.main(GrobidTest.java:17)
Caused by: java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
... 7 more

On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <[email protected]> wrote:

> Hello,
>
> There is a nice project called Grobid [1] that does most of what you are
> describing.
> Tika has Grobid parser built in (it calls grobid over REST API) . checkout
> [2] for details
>
> I have a project that makes use of Tika with Grobid and NER support. It
> also builds a search index using solr.
> Check out [3] for setting up and [4] for parsing and indexing to solr if
> you like to try out my python project.
> Here I am able to extract title, author names, affiliations, and the whole
> text of articles.
> I did not extract sections within the main body of research articles.  I
> assume there should be a way to configure it in Grobid.
>
> Alternatively, if Grobid can't detect sections, you can try XHTML content
> handler which preserves the basic structure of PDF file using <p>  <br> and
> heading tags. So technically it should be possible to write a wrapper to
> break XHTML output from tika into sections
>
> To get it:
>
> # In bash do `pip install tika’ if tika isn’t already installed
> import tika
> tika.initVM()
> from tika import parser
>
>
> file_path = "<pdf_dir>/2538.pdf"
> data = parser.from_file(file_path, xmlContent=True)
> print(data['content'])
>
>
>
>
> Best,
> Thamme
>
> [1] http://grobid.readthedocs.io/en/latest/Introduction/
> [2] https://wiki.apache.org/tika/GrobidJournalParser
> [3] https://github.com/USCDataScience/parser-indexer-
> py/tree/master/parser-server
> [4] https://github.com/USCDataScience/parser-indexer-
> py/blob/master/docs/parser-index-journals.md
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda <https://twitter.com/thammegowda>
> ~Sent via somebody's Webmail server!
>
> On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]>
> wrote:
>
>> Hi,
>>
>> I am working with published research articles using Apache Tika. These
>> articles have distinct sections like abstract, introduction, literature
>> review, methodology, experimental setup, discussion and conclusions. Is
>> there some way to extract document sections with Apache Tika
>>
>> Regards,
>>
>
>

Re: Analysing a document sections with Apache Tika

Reply via email to