[
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703387#comment-14703387
]
Nick Burch commented on TIKA-1699:
----------------------------------
Quick one - the wiki mentions needing to do a 600mb git checkout and then a
build. Is it possibly to just download a smaller pre-built package of GROBID to
skip this step? And if not, could we maybe suggest it to them for their next
release? (A 10s of MB download is probably easier and more beginner-friendly
then a huge checkout + having to build!)
> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>
> Key: TIKA-1699
> URL: https://issues.apache.org/jira/browse/TIKA-1699
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Sujen Shah
> Assignee: Chris A. Mattmann
> Labels: memex
> Fix For: 1.11
>
> Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt,
> TIKA-1699.restgrobid.MattmannWIP081515.patch.txt
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning
> library for extracting, parsing and re-structuring raw documents such as PDF
> into structured TEI-encoded documents with a particular focus on technical
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and
> help extract extra metadata about the paper like authors, publication,
> citations, etc.
> It would be nice to have this integrated into Tika, I have tried it on my
> local, will issue a pull request soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)