[jira] [Issue Comment Deleted] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1706: -- Comment: was deleted (was: A patch to bring back commons-io to tika-core and replace all formerly inlined classes.) Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1699: Attachment: TIKA-1699.restgrobid.MattmannWIP081515.patch.txt - here's a WIP patch to convert the Grobid parser to use its REST services. Tests are passing. I need to add the rest of the GROBID header XML metadata elements. Just got a bit tired :) [~sujenshah] if you want to finish this off, all you. Else if you don't beat me to it, maybe I'll finish it tomorrow. Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, TIKA-1699.restgrobid.MattmannWIP081515.patch.txt GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698477#comment-14698477 ] Yaniv Kunda commented on TIKA-1706: --- I've separated all the related changes besides adding commons-io to tika-core, and opened under TIKA-1710. In addition, the recently added commons-io-unsafe check have now found a couple of more default encoding usages: tika-core: src\main\java\org\apache\tika\Tika.java tika-server: src\test\java\org\apache\tika\server\CXFTestBase.java Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives
Yaniv Kunda created TIKA-1710: - Summary: Replace usages of classes in org.apache.tika.io with current alternatives Key: TIKA-1710 URL: https://issues.apache.org/jira/browse/TIKA-1710 Project: Tika Issue Type: Improvement Components: batch, cli, core, example, gui, parser, server, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Many of the classes in org.apache.tika.io were inlined from commons-io in TIKA-249, but these days most components use commons-io anyway, so in order to clean the dependencies on org.apache.tika.io in preparation of adding commons-io to tika-core, the following can be done: - Replace usages of classes in org.apache.tika.io within non-core components with the corresponding classes in commons-io - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core) - Replace other uses of String encoding names of standard charsets with their corresponding Charsets instances from StandardCharsets (this is logically related to IOUtils as these constants should have been there as UTF_8 was before Java 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698161#comment-14698161 ] Nick Burch commented on TIKA-1699: -- A build from trunk is now failing for me: {code} [ERROR] Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. - [Help 1] {code} With -X showing {code} Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. {code} Can we get this broken GROBIN dependency pom fixed / an exclusion in place, so that trunk builds again? Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch reopened TIKA-1699: -- Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-trunk-jdk1.7 - Build # 824 - Still Failing
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #824) Status: Still Failing Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/824/ to view the results.
[jira] [Comment Edited] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698161#comment-14698161 ] Chris A. Mattmann edited comment on TIKA-1699 at 8/15/15 5:46 PM: -- A build from trunk is now failing for me: {code} [ERROR] Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. - [Help 1] {code} With -X showing {code} Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. {code} Can we get this broken GROBID dependency pom fixed / an exclusion in place, so that trunk builds again? was (Author: gagravarr): A build from trunk is now failing for me: {code} [ERROR] Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. - [Help 1] {code} With -X showing {code} Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. {code} Can we get this broken GROBIN dependency pom fixed / an exclusion in place, so that trunk builds again? Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698371#comment-14698371 ] Nick Burch commented on TIKA-1699: -- {quote}Tika-app is ~48MB it seems so closer to 30% actually size increase.{quote} I added a bit on for the dependency jars that I can't get to! {quote}As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server?{quote} I know that for some of the dependencies so far, we've worked with them to produce a -min version or equivalent, with just the key parts in for size reasons. My first choice would be for something like that here. If not, could we follow the sqlite patterns, bundle the base java code as standard, but require people to download the large bulky native platform code to fully enable the support? (Assuming I've got the right idea about the bulk being from the CRF native stuff?) Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698375#comment-14698375 ] Chris A. Mattmann commented on TIKA-1699: - To use this patch, follow the instructions first here: https://wiki.apache.org/tika/GrobidJournalParser to install Grobid, and then apply this patch. Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698383#comment-14698383 ] Nick Burch commented on TIKA-1707: -- The build is hopefully working again now. If you could re-test, that'd be wonderful! Upgrade to Apache POI 3.13 Beta 2 - Key: TIKA-1707 URL: https://issues.apache.org/jira/browse/TIKA-1707 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.9 Reporter: Andreas Beeker Attachments: common_sl.diff In the not so far future, POI 3.13 Beta 2 will be available. This contains a quite big change to the Powerpoint modules XSLF/HSLF, but thankfully TIKA isn't much affected. Please try the patch on our trunk and post side-effects. As the work on the common_sl api hasn't been finished yet, there might be another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698401#comment-14698401 ] Hudson commented on TIKA-1706: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #826 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/826/]) Use a consistent version of Commons IO everywhere, enable the Forbidden APIs check for it, and fix problems it found TIKA-1706 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1696079) * /tika/trunk/tika-app/pom.xml * /tika/trunk/tika-batch/pom.xml * /tika/trunk/tika-example/pom.xml * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/DirListParser.java * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/MyFirstTika.java * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java * /tika/trunk/tika-example/src/test/java/org/apache/tika/example/SimpleTextExtractorTest.java * /tika/trunk/tika-parent/pom.xml * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-server/pom.xml * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698406#comment-14698406 ] Andreas Beeker commented on TIKA-1707: -- The affected test cases are ok now ... I haven't tried the full fledged tika test suite, as my JRE chokes on the 2GB heap settings, but tika-parsers seems to be ok with 1GB Upgrade to Apache POI 3.13 Beta 2 - Key: TIKA-1707 URL: https://issues.apache.org/jira/browse/TIKA-1707 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.9 Reporter: Andreas Beeker Attachments: common_sl.diff In the not so far future, POI 3.13 Beta 2 will be available. This contains a quite big change to the Powerpoint modules XSLF/HSLF, but thankfully TIKA isn't much affected. Please try the patch on our trunk and post side-effects. As the work on the common_sl api hasn't been finished yet, there might be another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698304#comment-14698304 ] Nick Burch commented on TIKA-1699: -- I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it! On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars? Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] A more modular parser project
Hi, So just to understand the break downs. When you say: tika-classic-parser-bundle/ Tika-office-parser-bundle/ (including microsoft, opendocument, pst, rtf, iwork? Has dependency on html/text) Tika-pdf-parser-bundle/ Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml, feed, iptc, crypto, etc?)/ Tika-sourcecode-parser-bundle (parsers that handle source code) Tika-package-parser-bundle (all zip/tar/etc) Does that indicate 6 bundles? 5 individuals that could wrap into 1 uber jar? Breaking things down at different levels will add to maintenance effort so it may be better to start with the broad strokes like tika-classic-parser-bundle. But if we just created a tika-classic-parser-bundle are we attempting to group the bundles by a type of usecase? I think this approach is fine but it does mean we're taking an opinion on what most of Tika's basic users want for simple usecases. Another approach could be grouping the parsers by similar dependencies which I think the tika-multimedia-parser-bundle does fairly well. From a dependence management perspective this is desirable. I've used tools like JDepend to break down which packages use which dependencies. Also determining package based dependencies within tika-parsers can be seen here in sonar: http://nemo.sonarqube.org/design/index/253571 With respect to bundles that don't fit perhaps those live on their own until an obvious emerges. It's much harder to remove something from a bundle than to add it later. I think this may apply to native bundles too. - Bob On 8/4/2015 8:32 AM, Allison, Timothy B. wrote: Bob, Thank you, again. This looks promising at first glance! To continue down the strawman path and to start discussion on the elephant in the room... We'd want bundles that allow enough control for users but aren't too much of a hassle to configure. There will be trade-offs. So, what do we think of this strawman for proposed bundles: tika-classic-parser-bundle/ Tika-office-parser-bundle/ (including microsoft, opendocument, pst, rtf, iwork? Has dependency on html/text) Tika-pdf-parser-bundle/ Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml, feed, iptc, crypto, etc?)/ Tika-sourcecode-parser-bundle (parsers that handle source code) Tika-package-parser-bundle (all zip/tar/etc) tika-multimedia-parser-bundle/ (parsers that pull metadata out of image, audio, audio+video files) Tika-image-parser-bundle Tika-image-ocr-parser-bundle Tika-audio-parser-bundle Tika-video-parser-bundle tika-scientific-parser-bundle/ (all parsers that handle scientific data sets (grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much hand-waving...input, Chris?) tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all parsers that rely on native libs...unfortunately, this doesn't fit well thematically...) tika-advanced-bundle/ (all parsers that rely on nlp or other advanced techniques for extraction of information... these aren't really just pulling text and metadata out, but are operating on the text/metadata once it has been pulled out. We may need separate bundles for each?) Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc. ...or maybe we want separate bundles for each?) Tika-sentiment-parser-bundle (imaginary...?) Tika-object-parser-bundle Where to put? font parser executable mat prt strings Cheers, Tim -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Tuesday, August 04, 2015 8:56 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] A more modular parser project So I just tried adding a META-INF/services/org.apache.tika.parser.Parser file to each bundle in the straw man implementation and it seemed to do the trick. Looks like the ServiceLoader code searches the classloader for all of these files and iterates through them to pick up each jar's META-INF/services/org.apache.tika.parser.Parser entries and adds them to the list. I've updated the code on github to include one per bundle. This might be the way to go. ex. https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-image-parser-bundle/src/main/resources/META-INF/services - Bob On 8/3/2015 9:21 PM, Allison, Timothy B. wrote: +1 to moving the source to bundles. I think for a 2.0 would be easier to consolidate into a parser uber jar than trying to tease things out like I did in the straw man impl. However deciding how to break things up might take some experimentation. Y, and the strawman is a great easy entry down this path towards 2.0. I think the main hangup will be coming to consensus about granularity and nature of the packages, but we can burn that bridge when we get to it. There are some
[jira] [Updated] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1699: Attachment: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt - here's the patch that Nick backed out in case folks want to use it while we get the Jars published to Central. Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698307#comment-14698307 ] Nick Burch commented on TIKA-1706: -- [~thetaphi] We currently have the forbidden apis check defined in the tika-parent pom. I've just tried adding {{{bundledSignaturecommons-io-unsafe-2.4/bundledSignature}}} there too, but that then causes the build of {{{tika-core}}} to fail, as core doesn't (yet) have commons-io available. Is there a way to make it skip the check if the classes aren't found, but do it if they are? Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698313#comment-14698313 ] Uwe Schindler commented on TIKA-1706: - Yes, you can add the maven property {{failOnUnresolvableSignaturesfalse/failOnUnresolvableSignatures to the plugin configuration}}: [http://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/check-mojo.html#failOnUnresolvableSignatures] An alternative is to only enable commons-io-unsafe-2.4 only for those modules where its used, unfortunately this is not so easy, because you cannot inherit only some array values to submodules, you miust reconfigure all bundledsignatures in submodules. Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698317#comment-14698317 ] Hudson commented on TIKA-1699: -- FAILURE: Integrated in tika-trunk-jdk1.7 #825 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/825/]) Back out r1695816, so the build can pass again, pending a fix of the broken grobid poms. Fix being tracked in TIKA-1699 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1696054) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal * /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698315#comment-14698315 ] Chris A. Mattmann commented on TIKA-1699: - bq. I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it! yeah we're working with them to getting this fixed. bq. On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars? Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/ Tika-app is ~48MB it seems so closer to 30% actually size increase. As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server? [~sujenshah] In that fashion we could omit adding really any dependencies beyond CXF and its WebClient. I'll investigate this. Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika
[ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698340#comment-14698340 ] Chris A. Mattmann commented on TIKA-1699: - All filed issues to publish all grobid-core deps: Wapiti jar fork: https://issues.sonatype.org/browse/OSSRH-17124 EUGFC ImageIO plugin: https://issues.sonatype.org/browse/OSSRH-17126 Language Detection: https://issues.sonatype.org/browse/OSSRH-17127 Chasen CRFPP: https://issues.sonatype.org/browse/OSSRH-17128 WIPO analysers: https://issues.sonatype.org/browse/OSSRH-17129 That should be all of them. Will let everyone know once it's published. Integrate the GROBID PDF extractor in Tika -- Key: TIKA-1699 URL: https://issues.apache.org/jira/browse/TIKA-1699 Project: Tika Issue Type: New Feature Components: parser Reporter: Sujen Shah Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications. It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)