Re: Analysing a document sections with Apache Tika

2017-05-04 Thread Thamme Gowda
Hi,

You CAN use tika in java code.
Tika is primarily written in Java and you will have no issues using in Java.
It may be a lot easier to use tika with Grobid instead of using
Grobid directly.

Checkout what resources are added to the classpath of "Tika-App"
https://wiki.apache.org/tika/GrobidJournalParser

Checkout these examples:
https://tika.apache.org/1.14/gettingstarted.html
https://tika.apache.org/1.14/examples.html


*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Thu, May 4, 2017 at 10:28 AM, tesm...@gmail.com <tesm...@gmail.com>
wrote:

> Hi,
>
> Thanks for sharing the link.
>
> I need to integrate this feature into my Java code.
>
>
> Regards,
>
>
> On Thu, May 4, 2017 at 4:47 PM, Chris Mattmann <mattm...@apache.org>
> wrote:
>
>> FYI here:
>>
>>
>>
>> http://wiki.apache.org/tika/GrobidJournalParser
>>
>>
>>
>>
>>
>>
>>
>> *From: *"tesm...@gmail.com" <tesm...@gmail.com>
>> *Reply-To: *"user@tika.apache.org" <user@tika.apache.org>
>> *Date: *Thursday, May 4, 2017 at 8:38 AM
>> *To: *"user@tika.apache.org" <user@tika.apache.org>
>> *Cc: *"thammego...@apache.org" <thammego...@apache.org>
>> *Subject: *Re: Analysing a document sections with Apache Tika
>>
>>
>>
>> Dear Thamme,
>>
>>
>>
>> Thanks for your reply and the suggestions.
>>
>>
>>
>> I build Grobid usign the instruction from http://grobid.readthedocs
>> .io/en/latest/Install-Grobid/
>>
>> Trying to run the following example code from GitHub repository(
>> https://github.com/kermitt2/grobid-example)
>>
>> =
>>
>>
>>
>>  import org.grobid.core.*;
>>
>> import org.grobid.core.data.*;
>>
>> import org.grobid.core.factory.*;
>>
>> import org.grobid.core.mock.*;
>>
>> import org.grobid.core.utilities.*;
>>
>> import org.grobid.core.engines.Engine;
>>
>>
>>
>> public class GrobidTest {
>>
>>
>>
>> public GrobidTest() {
>>
>> // TODO Auto-generated constructor stub
>>
>> }
>>
>> public static void main(String[] args)
>>
>> {
>>
>> run("D:/Eclipse-Workspace/PDFs/Train/6.pdf");
>>
>> }
>>
>> public static void run(String faFileName)
>>
>> {
>>
>> String pdfPath =faFileName;
>>
>>
>>
>> try {
>>
>> String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home";
>>
>> String pGrobidProperties = "D:/Eclipse-Workspace/Librarie
>> s/Grobid/grobid-home/config/grobid.properties";
>>
>>
>>
>> MockContext.setInitialContext(pGrobidHome, pGrobidProperties);
>>
>> GrobidProperties.getInstance();
>>
>>
>>
>> System.out.println(">>>>>>>> GROBID_HOME="+GrobidProperties
>> .get_GROBID_HOME_PATH());
>>
>>
>>
>> Engine engine = GrobidFactory.getInstance().createEngine();
>>
>>
>>
>> // Biblio object for the result
>>
>> BiblioItem resHeader = new BiblioItem();
>>
>> String tei = engine.processHeader(pdfPath, false, resHeader);
>>
>> }
>>
>> catch (Exception e) {
>>
>> // If an exception is generated, print a stack trace
>>
>> e.printStackTrace();
>>
>> }
>>
>> finally {
>>
>> try {
>>
>> MockContext.destroyInitialContext();
>>
>> }
>>
>> catch (Exception e) {
>>
>> e.printStackTrace();
>>
>> }
>>
>> }
>>
>> }
>>
>>
>>
>> }
>>
>>
>>
>> 
>>
>>
>>
>> Gettign the following exception:
>>
>>
>>
>> javax.naming.NoInitialContextException: Cannot instantiate class:
>> org.apache.naming.java.javaURLContextFactory [Root exception is
>> java.lang.ClassNotFoundException: org.apache.naming.java.javaURL
>> ContextFactory]
>>
>> at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
>>
>> at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
>>
>> at javax.naming.InitialContext.init(Unknown Source)
>>
>> at javax.naming.InitialContext.(Unknown Source)
>>
>> at org.grobid.core.mock.MockContext.setInitialContext(MockConte
>> xt.java:36)
>>
>&g

Re: Analysing a document sections with Apache Tika

2017-05-04 Thread Chris Mattmann
FYI here:

 

http://wiki.apache.org/tika/GrobidJournalParser 

 

 

 

From: "tesm...@gmail.com" <tesm...@gmail.com>
Reply-To: "user@tika.apache.org" <user@tika.apache.org>
Date: Thursday, May 4, 2017 at 8:38 AM
To: "user@tika.apache.org" <user@tika.apache.org>
Cc: "thammego...@apache.org" <thammego...@apache.org>
Subject: Re: Analysing a document sections with Apache Tika

 

Dear Thamme, 

 

Thanks for your reply and the suggestions.

 

I build Grobid usign the instruction from 
http://grobid.readthedocs.io/en/latest/Install-Grobid/

Trying to run the following example code from GitHub 
repository(https://github.com/kermitt2/grobid-example)

=

 

 import org.grobid.core.*;

import org.grobid.core.data.*;

import org.grobid.core.factory.*;

import org.grobid.core.mock.*;

import org.grobid.core.utilities.*;

import org.grobid.core.engines.Engine;

 

public class GrobidTest {

 

public GrobidTest() {

// TODO Auto-generated constructor stub

}

public static void main(String[] args)

 

{

run("D:/Eclipse-Workspace/PDFs/Train/6.pdf");

}

public static void run(String faFileName)

{

String pdfPath =faFileName;

  

try {

String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home";

String pGrobidProperties = 
"D:/Eclipse-Workspace/Libraries/Grobid/grobid-home/config/grobid.properties";

 

MockContext.setInitialContext(pGrobidHome, pGrobidProperties);

 

GrobidProperties.getInstance();

 

System.out.println(">>>>>>>> 
GROBID_HOME="+GrobidProperties.get_GROBID_HOME_PATH());

 

Engine engine = GrobidFactory.getInstance().createEngine();

 

// Biblio object for the result

BiblioItem resHeader = new BiblioItem();

String tei = engine.processHeader(pdfPath, false, resHeader);

} 

catch (Exception e) {

// If an exception is generated, print a stack trace

e.printStackTrace();

} 

finally {

try {

MockContext.destroyInitialContext();

} 

catch (Exception e) {

e.printStackTrace();

}

}

}

 

}

 



 

Gettign the following exception:

 

javax.naming.NoInitialContextException: Cannot instantiate class: 
org.apache.naming.java.javaURLContextFactory [Root exception is 
java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory]

at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)

at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)

at javax.naming.InitialContext.init(Unknown Source)

at javax.naming.InitialContext.(Unknown Source)

at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:36)

at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76)

at GrobidTest.run(GrobidTest.java:28)

at GrobidTest.main(GrobidTest.java:17)

Caused by: java.lang.ClassNotFoundException: 
org.apache.naming.java.javaURLContextFactory

at java.net.URLClassLoader.findClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

... 8 more

javax.naming.NoInitialContextException: Cannot instantiate class: 
org.apache.naming.java.javaURLContextFactory [Root exception is 
java.lang.ClassNotFoundException: org.apache.naming.java.javaURLContextFactory]

at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)

at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)

at javax.naming.InitialContext.init(Unknown Source)

at javax.naming.InitialContext.(Unknown Source)

at org.grobid.core.mock.MockContext.destroyInitialContext(MockContext.java:105)

at GrobidTest.run(GrobidTest.java:45)

at GrobidTest.main(GrobidTest.java:17)

Caused by: java.lang.ClassNotFoundException: 
org.apache.naming.java.javaURLContextFactory

at java.net.URLClassLoader.findClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)

... 7 more

 

 

 

 

On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <thammego...@apache.org> wrote:

Hello, 

 

There is a nice project called Grobid [1] that does most of what you are 
describing.

Tika has Grobid parser built in (it calls grobid over REST API) . checkout [2] 
for details

 

I have a project that makes use of Tika with Grobid and NER support. It also 
builds a search index using solr. 

Check out [

Re: Analysing a document sections with Apache Tika

2017-05-04 Thread tesm...@gmail.com
Dear Thamme,

Thanks for your reply and the suggestions.

I build Grobid usign the instruction from
http://grobid.readthedocs.io/en/latest/Install-Grobid/
Trying to run the following example code from GitHub repository(
https://github.com/kermitt2/grobid-example)
=

 import org.grobid.core.*;
import org.grobid.core.data.*;
import org.grobid.core.factory.*;
import org.grobid.core.mock.*;
import org.grobid.core.utilities.*;
import org.grobid.core.engines.Engine;

public class GrobidTest {

public GrobidTest() {
// TODO Auto-generated constructor stub
}
public static void main(String[] args)
{
run("D:/Eclipse-Workspace/PDFs/Train/6.pdf");
}
public static void run(String faFileName)
{
String pdfPath =faFileName;

try {
String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home";
String pGrobidProperties =
"D:/Eclipse-Workspace/Libraries/Grobid/grobid-home/config/grobid.properties";

MockContext.setInitialContext(pGrobidHome, pGrobidProperties);
GrobidProperties.getInstance();

System.out.println("
GROBID_HOME="+GrobidProperties.get_GROBID_HOME_PATH());

Engine engine = GrobidFactory.getInstance().createEngine();

// Biblio object for the result
BiblioItem resHeader = new BiblioItem();
String tei = engine.processHeader(pdfPath, false, resHeader);
}
catch (Exception e) {
// If an exception is generated, print a stack trace
e.printStackTrace();
}
finally {
try {
MockContext.destroyInitialContext();
}
catch (Exception e) {
e.printStackTrace();
}
}
}

}



Gettign the following exception:

javax.naming.NoInitialContextException: Cannot instantiate class:
org.apache.naming.java.javaURLContextFactory [Root exception is
java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory]
at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
at javax.naming.InitialContext.init(Unknown Source)
at javax.naming.InitialContext.(Unknown Source)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:36)
at org.grobid.core.mock.MockContext.setInitialContext(MockContext.java:76)
at GrobidTest.run(GrobidTest.java:28)
at GrobidTest.main(GrobidTest.java:17)
Caused by: java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
... 8 more
javax.naming.NoInitialContextException: Cannot instantiate class:
org.apache.naming.java.javaURLContextFactory [Root exception is
java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory]
at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
at javax.naming.InitialContext.init(Unknown Source)
at javax.naming.InitialContext.(Unknown Source)
at
org.grobid.core.mock.MockContext.destroyInitialContext(MockContext.java:105)
at GrobidTest.run(GrobidTest.java:45)
at GrobidTest.main(GrobidTest.java:17)
Caused by: java.lang.ClassNotFoundException:
org.apache.naming.java.javaURLContextFactory
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
... 7 more




On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda  wrote:

> Hello,
>
> There is a nice project called Grobid [1] that does most of what you are
> describing.
> Tika has Grobid parser built in (it calls grobid over REST API) . checkout
> [2] for details
>
> I have a project that makes use of Tika with Grobid and NER support. It
> also builds a search index using solr.
> Check out [3] for setting up and [4] for parsing and indexing to solr if
> you like to try out my python project.
> Here I am able to extract title, author names, affiliations, and the whole
> text of articles.
> I did not extract sections within the main body of research articles.  I
> assume there should be a way to configure it in Grobid.
>
> Alternatively, if Grobid can't detect sections, you can try XHTML content
> handler which preserves the basic structure of PDF file usingand
> heading tags. So technically it should be possible to write a wrapper to
> break XHTML output from tika into sections
>
> To get it:
>
> # In bash do `pip install tika’ if 

Re: Analysing a document sections with Apache Tika

2017-05-03 Thread Thamme Gowda
Hello,

There is a nice project called Grobid [1] that does most of what you are
describing.
Tika has Grobid parser built in (it calls grobid over REST API) . checkout
[2] for details

I have a project that makes use of Tika with Grobid and NER support. It
also builds a search index using solr.
Check out [3] for setting up and [4] for parsing and indexing to solr if
you like to try out my python project.
Here I am able to extract title, author names, affiliations, and the whole
text of articles.
I did not extract sections within the main body of research articles.  I
assume there should be a way to configure it in Grobid.

Alternatively, if Grobid can't detect sections, you can try XHTML content
handler which preserves the basic structure of PDF file usingand
heading tags. So technically it should be possible to write a wrapper to
break XHTML output from tika into sections

To get it:

# In bash do `pip install tika’ if tika isn’t already installed
import tika
tika.initVM()
from tika import parser


file_path = "/2538.pdf"
data = parser.from_file(file_path, xmlContent=True)
print(data['content'])




Best,
Thamme

[1] http://grobid.readthedocs.io/en/latest/Introduction/
[2] https://wiki.apache.org/tika/GrobidJournalParser
[3]
https://github.com/USCDataScience/parser-indexer-py/tree/master/parser-server
[4]
https://github.com/USCDataScience/parser-indexer-py/blob/master/docs/parser-index-journals.md


*--*
*Thamme Gowda*
TG | @thammegowda 
~Sent via somebody's Webmail server!

On Wed, May 3, 2017 at 9:34 AM, tesm...@gmail.com  wrote:

> Hi,
>
> I am working with published research articles using Apache Tika. These
> articles have distinct sections like abstract, introduction, literature
> review, methodology, experimental setup, discussion and conclusions. Is
> there some way to extract document sections with Apache Tika
>
> Regards,
>