[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698371#comment-14698371
 ] 

Nick Burch commented on TIKA-1699:
----------------------------------

{quote}Tika-app is ~48MB it seems so closer to 30% actually size 
increase.{quote}

I added a bit on for the dependency jars that I can't get to!

{quote}As for depending on a smaller core Jar, I had an idea here. Grobid has a 
server, I wonder if we should just connect to its REST server?{quote}

I know that for some of the dependencies so far, we've worked with them to 
produce a -min version or equivalent, with just the key parts in for size 
reasons. My first choice would be for something like that here. 

If not, could we follow the sqlite patterns, bundle the base java code as 
standard, but require people to download the large bulky native platform code 
to fully enable the support? (Assuming I've got the right idea about the bulk 
being from the CRF native stuff?)

> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>
>                 Key: TIKA-1699
>                 URL: https://issues.apache.org/jira/browse/TIKA-1699
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>         Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to