[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

Chris A. Mattmann (JIRA) Sat, 15 Aug 2015 08:27:04 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698315#comment-14698315
 ]


Chris A. Mattmann commented on TIKA-1699:
-----------------------------------------

bq. I've tried to exclude the grobid transient dependencies to work around this 
problem, but even an exclude of * still breaks the build on 
org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo 
definition. Unfortunately, I've therefore had to back out your r1695816, in 
order to unbreak the build. Hopefully we can get the grobid community to sort 
that shortly, and we can restore it!

yeah we're working with them to getting this fixed.

bq. On other possible issue spotted while failing to work around the broken pom 
- the grobid-core jar seems to be almost 15mb in size! Plus its dependencies 
themselves. That means we'll increase the size of the tika-app, tika-server and 
tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could 
depend on instead, which doesn't cause such a bump in our dependency sizes and 
jars?

Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/

Tika-app is ~48MB it seems so closer to 30% actually size increase. As for 
depending on a smaller core Jar, I had an idea here. Grobid has a server, I 
wonder if we should just connect to its REST server? [~sujenshah] In that 
fashion we could omit adding really any dependencies beyond CXF and its 
WebClient. I'll investigate this.


> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>
>                 Key: TIKA-1699
>                 URL: https://issues.apache.org/jira/browse/TIKA-1699
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

Reply via email to