[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-19 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703408#comment-14703408
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

Agreed. We have suggested it in 
[#59|http://github.com/kermit2/grobid/issues/59]. Please feel free to join the 
convo there.

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, 
> TIKA-1699.restgrobid.MattmannWIP081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703387#comment-14703387
 ] 

Nick Burch commented on TIKA-1699:
--

Quick one - the wiki mentions needing to do a 600mb git checkout and then a 
build. Is it possibly to just download a smaller pre-built package of GROBID to 
skip this step? And if not, could we maybe suggest it to them for their next 
release? (A 10s of MB download is probably easier and more beginner-friendly 
then a huge checkout + having to build!)

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, 
> TIKA-1699.restgrobid.MattmannWIP081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699953#comment-14699953
 ] 

Hudson commented on TIKA-1699:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #832 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/832/])
TIKA-1699: fix bundle for GROBID parser deps. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696319)
* /tika/trunk/tika-bundle/pom.xml
- TIKA-1699: statically load the rest URL properties inside of GROBIDRESTParser 
(mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696286)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java


> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, 
> TIKA-1699.restgrobid.MattmannWIP081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699131#comment-14699131
 ] 

Hudson commented on TIKA-1699:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #830 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/830/])
- fix typo: TIKA-1699 (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696192)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java
TIKA-1699: refactored GROBID parser to use GROBID rest API. Only introduced 2 
deps, CXF client, and also org.json. very small and works great. Thanks to 
Sujen Shah for his initial work on the GROBID patch. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696191)
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf


> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, 
> TIKA-1699.restgrobid.MattmannWIP081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-16 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698988#comment-14698988
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

OK I got the fully REST services version of the GROBID PDF parser implemented. 
Tests are passing and I'm going to commit it within the next few minutes. 
Basically it only adds the CXF rest client dependency and also the org.json 
dependency. Lot better, and lot smaller. Also GROBID can exist on another 
machine now. Will update the docs shortly.

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, 
> TIKA-1699.restgrobid.MattmannWIP081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698375#comment-14698375
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

To use this patch, follow the instructions first here: 
https://wiki.apache.org/tika/GrobidJournalParser to install Grobid, and then 
apply this patch.

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698371#comment-14698371
 ] 

Nick Burch commented on TIKA-1699:
--

{quote}Tika-app is ~48MB it seems so closer to 30% actually size 
increase.{quote}

I added a bit on for the dependency jars that I can't get to!

{quote}As for depending on a smaller core Jar, I had an idea here. Grobid has a 
server, I wonder if we should just connect to its REST server?{quote}

I know that for some of the dependencies so far, we've worked with them to 
produce a -min version or equivalent, with just the key parts in for size 
reasons. My first choice would be for something like that here. 

If not, could we follow the sqlite patterns, bundle the base java code as 
standard, but require people to download the large bulky native platform code 
to fully enable the support? (Assuming I've got the right idea about the bulk 
being from the CRF native stuff?)

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698340#comment-14698340
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

All filed issues to publish all grobid-core deps:
Wapiti jar fork:
https://issues.sonatype.org/browse/OSSRH-17124
EUGFC ImageIO plugin:
https://issues.sonatype.org/browse/OSSRH-17126
Language Detection: 
https://issues.sonatype.org/browse/OSSRH-17127
Chasen CRFPP: 
https://issues.sonatype.org/browse/OSSRH-17128
WIPO analysers: 
https://issues.sonatype.org/browse/OSSRH-17129 

That should be all of them. Will let everyone know once it's published.

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698317#comment-14698317
 ] 

Hudson commented on TIKA-1699:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #825 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/825/])
Back out r1695816, so the build can pass again, pending a fix of the broken 
grobid poms. Fix being tracked in TIKA-1699 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696054)
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf


> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698315#comment-14698315
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

bq. I've tried to exclude the grobid transient dependencies to work around this 
problem, but even an exclude of * still breaks the build on 
org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo 
definition. Unfortunately, I've therefore had to back out your r1695816, in 
order to unbreak the build. Hopefully we can get the grobid community to sort 
that shortly, and we can restore it!

yeah we're working with them to getting this fixed.

bq. On other possible issue spotted while failing to work around the broken pom 
- the grobid-core jar seems to be almost 15mb in size! Plus its dependencies 
themselves. That means we'll increase the size of the tika-app, tika-server and 
tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could 
depend on instead, which doesn't cause such a bump in our dependency sizes and 
jars?

Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/

Tika-app is ~48MB it seems so closer to 30% actually size increase. As for 
depending on a smaller core Jar, I had an idea here. Grobid has a server, I 
wonder if we should just connect to its REST server? [~sujenshah] In that 
fashion we could omit adding really any dependencies beyond CXF and its 
WebClient. I'll investigate this.


> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698304#comment-14698304
 ] 

Nick Burch commented on TIKA-1699:
--

I've tried to exclude the grobid transient dependencies to work around this 
problem, but even an exclude of * still breaks the build on 
org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo 
definition. Unfortunately, I've therefore had to back out your r1695816, in 
order to unbreak the build. Hopefully we can get the grobid community to sort 
that shortly, and we can restore it!

On other possible issue spotted while failing to work around the broken pom - 
the grobid-core jar seems to be almost 15mb in size! Plus its dependencies 
themselves. That means we'll increase the size of the tika-app, tika-server and 
tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could 
depend on instead, which doesn't cause such a bump in our dependency sizes and 
jars?

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698161#comment-14698161
 ] 

Nick Burch commented on TIKA-1699:
--

A build from trunk is now failing for me:
{code}
[ERROR] Failed to execute goal on project tika-parsers: Could not resolve 
dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: 
Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> 
org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for 
org.chasen:crfpp:jar:1.0.2: Could not transfer artifact 
org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created. -> [Help 1]
{code}

With -X showing
{code}
Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed 
to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> 
org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to 
read artifact descriptor for org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not 
transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created.
{code}

Can we get this broken GROBIN dependency pom fixed / an exclusion in place, so 
that trunk builds again?

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696578#comment-14696578
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

docs are here: https://wiki.apache.org/tika/GrobidJournalParser

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696568#comment-14696568
 ] 

Hudson commented on TIKA-1699:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #821 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/821/])
Changes.txt for TIKA-1699. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1695817)
* /tika/trunk/CHANGES.txt
- fix for TIKA-1699: Integrate the GROBID PDF extractor in Tika contributed by 
Sujen Shah  this closes #55. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1695816)
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidConfig.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidHeaderMetadata.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf


> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-13 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696516#comment-14696516
 ] 

Sujen Shah commented on TIKA-1699:
--

Awesome [~chrismattmann] !! Thank you :) Will start work on the wiki. 

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-13 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696517#comment-14696517
 ] 

Sujen Shah commented on TIKA-1699:
--

Awesome [~chrismattmann] !! Thank you :) Will start work on the wiki. 

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696515#comment-14696515
 ] 

ASF GitHub Bot commented on TIKA-1699:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/55


> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696513#comment-14696513
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

I got this working! :-) 

h2. Starting Tika Server
{noformat}
java -Dorg.apache.tika.service.error.warn=true -classpath 
$HOME/git/grobidparser-resources/:$HOME/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\*
 org.apache.tika.server.TikaServerCli --config tika-config.xml
{noformat}

h2. cURL command to test
{noformat}
curl -T $HOME/git/grobid/papers/ICSE06.pdf -H "Content-Disposition: 
attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta | python -mjson.tool
{noformat}

h2. Output

{noformat}
[
{
"Author": "End User Computing Services",
"Company": "ACM",
"Content-Type": "application/pdf",
"Creation-Date": "2006-02-15T21:13:58Z",
"Last-Modified": "2006-02-15T21:16:01Z",
"Last-Save-Date": "2006-02-15T21:16:01Z",
"SourceModified": "D:20060215211344",
"X-Parsed-By": [
"org.apache.tika.parser.CompositeParser",
"org.apache.tika.parser.journal.JournalParser"
],
"X-TIKA:content": 
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings
 Template - WORD\n\n\nA Software Architecture-Based Framework for Highly 
\nDistributed and Data Intensive Scientific Applications \n\n \nChris A. 
Mattmann1, 2Daniel J. Crichton1Nenad Medvidovic2Steve 
Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of 
Technology \nPasadena, CA 91109, USA 
\n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science 
Department \nUniversity of Southern California  \n\nLos Angeles, CA 90089, USA 
\n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is 
increasingly conducted by virtual \ncommunities of scientists distributed 
around the world. The data \nvolumes created by these communities are extremely 
large, and \ngrowing rapidly. The management of the resulting highly 
\ndistributed, virtual data systems is a complex task, characterized \nby a 
number of formidable technical challenges, many of which \nare of a software 
engineering nature.  In this paper we describe \nour experience over the past 
seven years in constructing and \ndeploying OODT, a software framework that 
supports large, \ndistributed, virtual scientific communities. We outline the 
key \nsoftware engineering challenges that we faced, and addressed, \nalong the 
way. We argue that a major contributor to the success of \nOODT was its 
explicit focus on software architecture. We \ndescribe several large-scale, 
real-world deployments of OODT, \nand the manner in which OODT helped us to 
address the domain-\nspecific challenges induced by each deployment.  
\n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11 
Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software 
Architecture. \n\n1. INTRODUCTION \nSoftware systems of today are very large, 
highly complex, \n\noften widely distributed, increasingly decentralized, 
dynamic, and \nmobile.  There are many causes behind this, spanning virtually 
all \nfacets of human endeavor: desired advances in education, \nentertainment, 
medicine, military technology, \ntelecommunications, transportation, and so on. 
  \n\nOne major driver of software\u2019s growing complexity is \nscientific 
research and exploration.  Today\u2019s scientists are solving \nproblems of 
until recently unimaginable complexity with the help \nof software.  They also 
actively and regularly collaborate with \n\ncolleagues around the world, 
something that has become possible \nonly relatively recently, again ultimately 
thanks to software. They \nare collecting, producing, sharing, and 
disseminating large \namounts of data, which are growing by orders of magnitude 
in \nvolume in remarkably short time periods. \n\nIt is this latter problem 
that NASA\u2019s Jet Propulsion \nLaboratory (JPL) began facing several years 
ago.  Until recently, \nJPL would disseminate data collected by various 
instruments \n(Earth-based, orbiting, and in outer space) to the interested 
\nscientists around the United States by \u201cburning\u201d CD-ROMs and 
\nmailing them via the U.S. Postal Service.  In addition to being \nslow, 
sequential, unidirectional, and lacking interactivity, this \nmethod was 
expensive, costing hundreds of thousands of dollars. \nFurthermore, the method 
was prone to security breaches, and the \nexact data distribution (determining 
which data goes to which \ndestinations) had to be calculated for each 
individual shipment. It \nhad become increasingly difficult to manage this 
process as the \nnumber of projects and missions, as well as involved 
scientists, \ngrew.  An even

[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652967#comment-14652967
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

Sujen please update the PR with my 2 comments/updates and then also please let 
me know when the rest of the JAR files are on central then I think we can 
integrate this. We should also make a custom tika-config to override the 
default PDF parser, or better yet to somehow combine it with this. That's one 
thing I thought too - it would make sense to combine these, right, or are they 
separate parsers, really? It seems like they should be separate because 
potentially they have overlapping keys, right?

We also need to make a page on the Tika wiki that describes how to install 
Grobid: http://wiki.apache.org/tika/GrobidParser maybe?

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-07-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646321#comment-14646321
 ] 

ASF GitHub Bot commented on TIKA-1699:
--

GitHub user sujen1412 opened a pull request:

https://github.com/apache/tika/pull/55

Fix for TIKA-1699 contributed by Sujen Shah

Waiting for GROBID to get published to maven central. 
Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujen1412/tika TIKA-1699

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/55.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #55


commit 4f067107d01e99bd81a66c78163f2a4baf3f817f
Author: Sujen Shah 
Date:   2015-07-29T13:49:00Z

Added grobid dependencies

commit 323ba33816a9beabe22d351c8eac4350fa010be0
Author: Sujen Shah 
Date:   2015-07-29T13:49:36Z

Registering journal parser

commit 71cdd0970fb17aeec85469d07dc1ee6460d2f4da
Author: Sujen Shah 
Date:   2015-07-29T13:54:07Z

Code for integrating GROBID Parser in to Tika

commit b6e9f8724b308e0c830f73702994cbe1c5932cd2
Author: Sujen Shah 
Date:   2015-07-29T13:58:08Z

Grobid properties files

commit 57b70ce38a77cc349588d2f513938bc4f18d4ad4
Author: Sujen Shah 
Date:   2015-07-29T13:58:58Z

Added unit test for journal parser

Corrected formatting

Corrected formatting

Corrected formatting




> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>  Labels: memex
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-07-29 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646316#comment-14646316
 ] 

Sujen Shah commented on TIKA-1699:
--

Working towards publishing GROBID to Maven Central though Sonatype. 

Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837
Grobid issue - https://github.com/kermitt2/grobid/issues/59

> Integrate the GROBID PDF extractor in Tika
> --
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Sujen Shah
>  Labels: memex
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)