[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

Chris A. Mattmann (JIRA) Mon, 03 Aug 2015 19:25:48 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652967#comment-14652967
 ]


Chris A. Mattmann commented on TIKA-1699:
-----------------------------------------

Sujen please update the PR with my 2 comments/updates and then also please let 
me know when the rest of the JAR files are on central then I think we can 
integrate this. We should also make a custom tika-config to override the 
default PDF parser, or better yet to somehow combine it with this. That's one 
thing I thought too - it would make sense to combine these, right, or are they 
separate parsers, really? It seems like they should be separate because 
potentially they have overlapping keys, right?

We also need to make a page on the Tika wiki that describes how to install 
Grobid: http://wiki.apache.org/tika/GrobidParser maybe?

> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>
>                 Key: TIKA-1699
>                 URL: https://issues.apache.org/jira/browse/TIKA-1699
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
> library for extracting, parsing and re-structuring raw documents such as PDF 
> into structured TEI-encoded documents with a particular focus on technical 
> and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and 
> help extract extra metadata about the paper like authors, publication, 
> citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my 
> local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

Reply via email to