[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703408#comment-14703408 ] Chris A. Mattmann commented on TIKA-1699: - Agreed. We have suggested it in [#59|http://github.com/kermit2/grobid/issues/59]. Please feel free to join the convo there. > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, > TIKA-1699.restgrobid.MattmannWIP081515.patch.txt > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703387#comment-14703387 ] Nick Burch commented on TIKA-1699: -- Quick one - the wiki mentions needing to do a 600mb git checkout and then a build. Is it possibly to just download a smaller pre-built package of GROBID to skip this step? And if not, could we maybe suggest it to them for their next release? (A 10s of MB download is probably easier and more beginner-friendly then a huge checkout + having to build!) > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, > TIKA-1699.restgrobid.MattmannWIP081515.patch.txt > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699953#comment-14699953 ] Hudson commented on TIKA-1699: -- FAILURE: Integrated in tika-trunk-jdk1.7 #832 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/832/]) TIKA-1699: fix bundle for GROBID parser deps. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696319) * /tika/trunk/tika-bundle/pom.xml - TIKA-1699: statically load the rest URL properties inside of GROBIDRESTParser (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696286) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, > TIKA-1699.restgrobid.MattmannWIP081515.patch.txt > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699131#comment-14699131 ] Hudson commented on TIKA-1699: -- FAILURE: Integrated in tika-trunk-jdk1.7 #830 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/830/]) - fix typo: TIKA-1699 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696192) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java TIKA-1699: refactored GROBID parser to use GROBID rest API. Only introduced 2 deps, CXF client, and also org.json. very small and works great. Thanks to Sujen Shah for his initial work on the GROBID patch. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696191) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java * /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, > TIKA-1699.restgrobid.MattmannWIP081515.patch.txt > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698988#comment-14698988 ] Chris A. Mattmann commented on TIKA-1699: - OK I got the fully REST services version of the GROBID PDF parser implemented. Tests are passing and I'm going to commit it within the next few minutes. Basically it only adds the CXF rest client dependency and also the org.json dependency. Lot better, and lot smaller. Also GROBID can exist on another machine now. Will update the docs shortly. > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, > TIKA-1699.restgrobid.MattmannWIP081515.patch.txt > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698375#comment-14698375 ] Chris A. Mattmann commented on TIKA-1699: - To use this patch, follow the instructions first here: https://wiki.apache.org/tika/GrobidJournalParser to install Grobid, and then apply this patch. > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698371#comment-14698371 ] Nick Burch commented on TIKA-1699: -- {quote}Tika-app is ~48MB it seems so closer to 30% actually size increase.{quote} I added a bit on for the dependency jars that I can't get to! {quote}As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server?{quote} I know that for some of the dependencies so far, we've worked with them to produce a -min version or equivalent, with just the key parts in for size reasons. My first choice would be for something like that here. If not, could we follow the sqlite patterns, bundle the base java code as standard, but require people to download the large bulky native platform code to fully enable the support? (Assuming I've got the right idea about the bulk being from the CRF native stuff?) > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698340#comment-14698340 ] Chris A. Mattmann commented on TIKA-1699: - All filed issues to publish all grobid-core deps: Wapiti jar fork: https://issues.sonatype.org/browse/OSSRH-17124 EUGFC ImageIO plugin: https://issues.sonatype.org/browse/OSSRH-17126 Language Detection: https://issues.sonatype.org/browse/OSSRH-17127 Chasen CRFPP: https://issues.sonatype.org/browse/OSSRH-17128 WIPO analysers: https://issues.sonatype.org/browse/OSSRH-17129 That should be all of them. Will let everyone know once it's published. > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698317#comment-14698317 ] Hudson commented on TIKA-1699: -- FAILURE: Integrated in tika-trunk-jdk1.7 #825 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/825/]) Back out r1695816, so the build can pass again, pending a fix of the broken grobid poms. Fix being tracked in TIKA-1699 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696054) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698315#comment-14698315 ] Chris A. Mattmann commented on TIKA-1699: - bq. I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it! yeah we're working with them to getting this fixed. bq. On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars? Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/ Tika-app is ~48MB it seems so closer to 30% actually size increase. As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server? [~sujenshah] In that fashion we could omit adding really any dependencies beyond CXF and its WebClient. I'll investigate this. > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698304#comment-14698304 ] Nick Burch commented on TIKA-1699: -- I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it! On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars? > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698161#comment-14698161 ] Nick Burch commented on TIKA-1699: -- A build from trunk is now failing for me: {code} [ERROR] Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. -> [Help 1] {code} With -X showing {code} Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. {code} Can we get this broken GROBIN dependency pom fixed / an exclusion in place, so that trunk builds again? > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696578#comment-14696578 ] Chris A. Mattmann commented on TIKA-1699: - docs are here: https://wiki.apache.org/tika/GrobidJournalParser > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696568#comment-14696568 ] Hudson commented on TIKA-1699: -- FAILURE: Integrated in tika-trunk-jdk1.7 #821 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/821/]) Changes.txt for TIKA-1699. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1695817) * /tika/trunk/CHANGES.txt - fix for TIKA-1699: Integrate the GROBID PDF extractor in Tika contributed by Sujen Shah this closes #55. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1695816) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidConfig.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidHeaderMetadata.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java * /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696516#comment-14696516 ] Sujen Shah commented on TIKA-1699: -- Awesome [~chrismattmann] !! Thank you :) Will start work on the wiki. > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696517#comment-14696517 ] Sujen Shah commented on TIKA-1699: -- Awesome [~chrismattmann] !! Thank you :) Will start work on the wiki. > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696515#comment-14696515 ] ASF GitHub Bot commented on TIKA-1699: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/55 > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696513#comment-14696513 ] Chris A. Mattmann commented on TIKA-1699: - I got this working! :-) h2. Starting Tika Server {noformat} java -Dorg.apache.tika.service.error.warn=true -classpath $HOME/git/grobidparser-resources/:$HOME/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\* org.apache.tika.server.TikaServerCli --config tika-config.xml {noformat} h2. cURL command to test {noformat} curl -T $HOME/git/grobid/papers/ICSE06.pdf -H "Content-Disposition: attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta | python -mjson.tool {noformat} h2. Output {noformat} [ { "Author": "End User Computing Services", "Company": "ACM", "Content-Type": "application/pdf", "Creation-Date": "2006-02-15T21:13:58Z", "Last-Modified": "2006-02-15T21:16:01Z", "Last-Save-Date": "2006-02-15T21:16:01Z", "SourceModified": "D:20060215211344", "X-Parsed-By": [ "org.apache.tika.parser.CompositeParser", "org.apache.tika.parser.journal.JournalParser" ], "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings Template - WORD\n\n\nA Software Architecture-Based Framework for Highly \nDistributed and Data Intensive Scientific Applications \n\n \nChris A. Mattmann1, 2Daniel J. Crichton1Nenad Medvidovic2Steve Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of Technology \nPasadena, CA 91109, USA \n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science Department \nUniversity of Southern California \n\nLos Angeles, CA 90089, USA \n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is increasingly conducted by virtual \ncommunities of scientists distributed around the world. The data \nvolumes created by these communities are extremely large, and \ngrowing rapidly. The management of the resulting highly \ndistributed, virtual data systems is a complex task, characterized \nby a number of formidable technical challenges, many of which \nare of a software engineering nature. In this paper we describe \nour experience over the past seven years in constructing and \ndeploying OODT, a software framework that supports large, \ndistributed, virtual scientific communities. We outline the key \nsoftware engineering challenges that we faced, and addressed, \nalong the way. We argue that a major contributor to the success of \nOODT was its explicit focus on software architecture. We \ndescribe several large-scale, real-world deployments of OODT, \nand the manner in which OODT helped us to address the domain-\nspecific challenges induced by each deployment. \n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11 Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software Architecture. \n\n1. INTRODUCTION \nSoftware systems of today are very large, highly complex, \n\noften widely distributed, increasingly decentralized, dynamic, and \nmobile. There are many causes behind this, spanning virtually all \nfacets of human endeavor: desired advances in education, \nentertainment, medicine, military technology, \ntelecommunications, transportation, and so on. \n\nOne major driver of software\u2019s growing complexity is \nscientific research and exploration. Today\u2019s scientists are solving \nproblems of until recently unimaginable complexity with the help \nof software. They also actively and regularly collaborate with \n\ncolleagues around the world, something that has become possible \nonly relatively recently, again ultimately thanks to software. They \nare collecting, producing, sharing, and disseminating large \namounts of data, which are growing by orders of magnitude in \nvolume in remarkably short time periods. \n\nIt is this latter problem that NASA\u2019s Jet Propulsion \nLaboratory (JPL) began facing several years ago. Until recently, \nJPL would disseminate data collected by various instruments \n(Earth-based, orbiting, and in outer space) to the interested \nscientists around the United States by \u201cburning\u201d CD-ROMs and \nmailing them via the U.S. Postal Service. In addition to being \nslow, sequential, unidirectional, and lacking interactivity, this \nmethod was expensive, costing hundreds of thousands of dollars. \nFurthermore, the method was prone to security breaches, and the \nexact data distribution (determining which data goes to which \ndestinations) had to be calculated for each individual shipment. It \nhad become increasingly difficult to manage this process as the \nnumber of projects and missions, as well as involved scientists, \ngrew. An even
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652967#comment-14652967 ] Chris A. Mattmann commented on TIKA-1699: - Sujen please update the PR with my 2 comments/updates and then also please let me know when the rest of the JAR files are on central then I think we can integrate this. We should also make a custom tika-config to override the default PDF parser, or better yet to somehow combine it with this. That's one thing I thought too - it would make sense to combine these, right, or are they separate parsers, really? It seems like they should be separate because potentially they have overlapping keys, right? We also need to make a page on the Tika wiki that describes how to install Grobid: http://wiki.apache.org/tika/GrobidParser maybe? > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646321#comment-14646321 ] ASF GitHub Bot commented on TIKA-1699: -- GitHub user sujen1412 opened a pull request: https://github.com/apache/tika/pull/55 Fix for TIKA-1699 contributed by Sujen Shah Waiting for GROBID to get published to maven central. Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837 You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/tika TIKA-1699 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit 4f067107d01e99bd81a66c78163f2a4baf3f817f Author: Sujen Shah Date: 2015-07-29T13:49:00Z Added grobid dependencies commit 323ba33816a9beabe22d351c8eac4350fa010be0 Author: Sujen Shah Date: 2015-07-29T13:49:36Z Registering journal parser commit 71cdd0970fb17aeec85469d07dc1ee6460d2f4da Author: Sujen Shah Date: 2015-07-29T13:54:07Z Code for integrating GROBID Parser in to Tika commit b6e9f8724b308e0c830f73702994cbe1c5932cd2 Author: Sujen Shah Date: 2015-07-29T13:58:08Z Grobid properties files commit 57b70ce38a77cc349588d2f513938bc4f18d4ad4 Author: Sujen Shah Date: 2015-07-29T13:58:58Z Added unit test for journal parser Corrected formatting Corrected formatting Corrected formatting > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah > Labels: memex > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646316#comment-14646316 ] Sujen Shah commented on TIKA-1699: -- Working towards publishing GROBID to Maven Central though Sonatype. Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837 Grobid issue - https://github.com/kermitt2/grobid/issues/59 > Integrate the GROBID PDF extractor in Tika > -- > > Key: TIKA-1699 > URL: https://issues.apache.org/jira/browse/TIKA-1699 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Sujen Shah > Labels: memex > > GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning > library for extracting, parsing and re-structuring raw documents such as PDF > into structured TEI-encoded documents with a particular focus on technical > and scientific publications. > It has a java api which can be used to augment PDF parsing for journals and > help extract extra metadata about the paper like authors, publication, > citations, etc. > It would be nice to have this integrated into Tika, I have tried it on my > local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)