[jira] [Issue Comment Deleted] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1706:
--
Comment: was deleted

(was: A patch to bring back commons-io to tika-core and replace all formerly 
inlined classes.)

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1699:

Attachment: TIKA-1699.restgrobid.MattmannWIP081515.patch.txt

- here's a WIP patch to convert the Grobid parser to use its REST services. 
Tests are passing. I need to add the rest of the GROBID header XML metadata 
elements. Just got a bit tired :) [~sujenshah] if you want to finish this off, 
all you. Else if you don't beat me to it, maybe I'll finish it tomorrow.

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11

 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt, 
 TIKA-1699.restgrobid.MattmannWIP081515.patch.txt


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698477#comment-14698477
 ] 

Yaniv Kunda commented on TIKA-1706:
---

I've separated all the related changes besides adding commons-io to tika-core, 
and opened under TIKA-1710.
In addition, the recently added commons-io-unsafe check have now found a couple 
of more default encoding usages:
tika-core:   src\main\java\org\apache\tika\Tika.java
tika-server: src\test\java\org\apache\tika\server\CXFTestBase.java


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-15 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1710:
-

 Summary: Replace usages of classes in org.apache.tika.io with 
current alternatives
 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Many of the classes in org.apache.tika.io were inlined from commons-io in 
TIKA-249, but these days most components use commons-io anyway, so in order to 
clean the dependencies on org.apache.tika.io in preparation of adding 
commons-io to tika-core, the following can be done:
- Replace usages of classes in org.apache.tika.io within non-core components 
with the corresponding classes in commons-io
- Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core)
- Replace other uses of String encoding names of standard charsets with their 
corresponding Charsets instances from StandardCharsets (this is logically 
related to IOUtils as these constants should have been there as UTF_8 was 
before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698161#comment-14698161
 ] 

Nick Burch commented on TIKA-1699:
--

A build from trunk is now failing for me:
{code}
[ERROR] Failed to execute goal on project tika-parsers: Could not resolve 
dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: 
Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - 
org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for 
org.chasen:crfpp:jar:1.0.2: Could not transfer artifact 
org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created. - [Help 1]
{code}

With -X showing
{code}
Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed 
to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - 
org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to 
read artifact descriptor for org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not 
transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created.
{code}

Can we get this broken GROBIN dependency pom fixed / an exclusion in place, so 
that trunk builds again?

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch reopened TIKA-1699:
--

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-trunk-jdk1.7 - Build # 824 - Still Failing

2015-08-15 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #824)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/824/ to 
view the results.

[jira] [Comment Edited] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698161#comment-14698161
 ] 

Chris A. Mattmann edited comment on TIKA-1699 at 8/15/15 5:46 PM:
--

A build from trunk is now failing for me:
{code}
[ERROR] Failed to execute goal on project tika-parsers: Could not resolve 
dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: 
Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - 
org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for 
org.chasen:crfpp:jar:1.0.2: Could not transfer artifact 
org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created. - [Help 1]
{code}

With -X showing
{code}
Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed 
to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - 
org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to 
read artifact descriptor for org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not 
transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created.
{code}

Can we get this broken GROBID dependency pom fixed / an exclusion in place, so 
that trunk builds again?


was (Author: gagravarr):
A build from trunk is now failing for me:
{code}
[ERROR] Failed to execute goal on project tika-parsers: Could not resolve 
dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: 
Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - 
org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for 
org.chasen:crfpp:jar:1.0.2: Could not transfer artifact 
org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created. - [Help 1]
{code}

With -X showing
{code}
Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed 
to collect dependencies at org.grobid:grobid-core:jar:0.3.4 - 
org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to 
read artifact descriptor for org.chasen:crfpp:jar:1.0.2
Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not 
transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo 
(file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and 
cannot be created.
{code}

Can we get this broken GROBIN dependency pom fixed / an exclusion in place, so 
that trunk builds again?

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11

 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698371#comment-14698371
 ] 

Nick Burch commented on TIKA-1699:
--

{quote}Tika-app is ~48MB it seems so closer to 30% actually size 
increase.{quote}

I added a bit on for the dependency jars that I can't get to!

{quote}As for depending on a smaller core Jar, I had an idea here. Grobid has a 
server, I wonder if we should just connect to its REST server?{quote}

I know that for some of the dependencies so far, we've worked with them to 
produce a -min version or equivalent, with just the key parts in for size 
reasons. My first choice would be for something like that here. 

If not, could we follow the sqlite patterns, bundle the base java code as 
standard, but require people to download the large bulky native platform code 
to fully enable the support? (Assuming I've got the right idea about the bulk 
being from the CRF native stuff?)

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11

 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698375#comment-14698375
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

To use this patch, follow the instructions first here: 
https://wiki.apache.org/tika/GrobidJournalParser to install Grobid, and then 
apply this patch.

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11

 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698383#comment-14698383
 ] 

Nick Burch commented on TIKA-1707:
--

The build is hopefully working again now. If you could re-test, that'd be 
wonderful!

 Upgrade to Apache POI 3.13 Beta 2
 -

 Key: TIKA-1707
 URL: https://issues.apache.org/jira/browse/TIKA-1707
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.9
Reporter: Andreas Beeker
 Attachments: common_sl.diff


 In the not so far future, POI 3.13 Beta 2 will be available.
 This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
 thankfully TIKA isn't much affected.
 Please try the patch on our trunk and post side-effects.
 As the work on the common_sl api hasn't been finished yet, there might be 
 another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698401#comment-14698401
 ] 

Hudson commented on TIKA-1706:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #826 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/826/])
Use a consistent version of Commons IO everywhere, enable the Forbidden APIs 
check for it, and fix problems it found TIKA-1706 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1696079)
* /tika/trunk/tika-app/pom.xml
* /tika/trunk/tika-batch/pom.xml
* /tika/trunk/tika-example/pom.xml
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/DirListParser.java
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/MyFirstTika.java
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java
* 
/tika/trunk/tika-example/src/test/java/org/apache/tika/example/SimpleTextExtractorTest.java
* /tika/trunk/tika-parent/pom.xml
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-server/pom.xml
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-08-15 Thread Andreas Beeker (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698406#comment-14698406
 ] 

Andreas Beeker commented on TIKA-1707:
--

The affected test cases are ok now ... I haven't tried the full fledged tika 
test suite, as my JRE chokes on the 2GB heap settings, but tika-parsers seems 
to be ok with 1GB

 Upgrade to Apache POI 3.13 Beta 2
 -

 Key: TIKA-1707
 URL: https://issues.apache.org/jira/browse/TIKA-1707
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.9
Reporter: Andreas Beeker
 Attachments: common_sl.diff


 In the not so far future, POI 3.13 Beta 2 will be available.
 This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
 thankfully TIKA isn't much affected.
 Please try the patch on our trunk and post side-effects.
 As the work on the common_sl api hasn't been finished yet, there might be 
 another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698304#comment-14698304
 ] 

Nick Burch commented on TIKA-1699:
--

I've tried to exclude the grobid transient dependencies to work around this 
problem, but even an exclude of * still breaks the build on 
org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo 
definition. Unfortunately, I've therefore had to back out your r1695816, in 
order to unbreak the build. Hopefully we can get the grobid community to sort 
that shortly, and we can restore it!

On other possible issue spotted while failing to work around the broken pom - 
the grobid-core jar seems to be almost 15mb in size! Plus its dependencies 
themselves. That means we'll increase the size of the tika-app, tika-server and 
tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could 
depend on instead, which doesn't cause such a bump in our dependency sizes and 
jars?

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] A more modular parser project

2015-08-15 Thread Bob Paulin

Hi,

So just to understand the break downs.  When you say:

tika-classic-parser-bundle/
Tika-office-parser-bundle/ (including microsoft, opendocument, pst, 
rtf, iwork? Has dependency on html/text)
Tika-pdf-parser-bundle/
Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml, 
feed, iptc, crypto, etc?)/
Tika-sourcecode-parser-bundle (parsers that handle source code)
Tika-package-parser-bundle (all zip/tar/etc)

Does that indicate 6 bundles?  5 individuals that could wrap into 1 uber 
jar?  Breaking things down at different levels will add to maintenance 
effort so it may be better to start with the broad strokes like 
tika-classic-parser-bundle.  But if we just created a 
tika-classic-parser-bundle are we attempting to group the bundles by a 
type of usecase?  I think this approach is fine but it does mean we're 
taking an opinion on what most of Tika's basic users want for simple 
usecases.


Another approach could be grouping the parsers by similar dependencies 
which I think the tika-multimedia-parser-bundle does fairly well.  From 
a dependence management perspective this is desirable.  I've used tools 
like JDepend to break down which packages use which dependencies.  Also 
determining package based dependencies within tika-parsers can be seen 
here in sonar:


http://nemo.sonarqube.org/design/index/253571


With respect to bundles that don't fit perhaps those live on their own 
until an obvious emerges.  It's much harder to remove something from a 
bundle than to add it later.  I think this may apply to native bundles too.


- Bob

On 8/4/2015 8:32 AM, Allison, Timothy B. wrote:

Bob,
   Thank you, again.  This looks promising at first glance!

To continue down the strawman path and to start discussion on the elephant in 
the room...

We'd want bundles that allow enough control for users but aren't too much of a 
hassle to configure.  There will be trade-offs.

So, what do we think of this strawman for proposed bundles:

tika-classic-parser-bundle/
Tika-office-parser-bundle/ (including microsoft, opendocument, pst, 
rtf, iwork? Has dependency on html/text)
Tika-pdf-parser-bundle/
 Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, 
kml, feed, iptc, crypto, etc?)/
Tika-sourcecode-parser-bundle (parsers that handle source code)
Tika-package-parser-bundle (all zip/tar/etc)

tika-multimedia-parser-bundle/  (parsers that pull metadata out of image, 
audio, audio+video files)
Tika-image-parser-bundle
Tika-image-ocr-parser-bundle
Tika-audio-parser-bundle
Tika-video-parser-bundle

tika-scientific-parser-bundle/ (all parsers that handle scientific data sets 
(grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much hand-waving...input, Chris?)

tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all parsers 
that rely on native libs...unfortunately, this doesn't fit well thematically...)

tika-advanced-bundle/ (all parsers that rely on nlp or other advanced 
techniques for extraction of information...
these aren't really just pulling text and metadata out, but are 
operating on the text/metadata
 once it has been pulled out.  We may need separate bundles for 
each?)
Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc.
...or maybe we want separate bundles for each?)
Tika-sentiment-parser-bundle (imaginary...?)
Tika-object-parser-bundle

Where to put?
 font parser
executable
mat
prt
strings


Cheers,
  
Tim




-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com]
Sent: Tuesday, August 04, 2015 8:56 AM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] A more modular parser project

So I just tried adding a META-INF/services/org.apache.tika.parser.Parser
file to each bundle in the straw man implementation and it seemed to do
the trick. Looks like the ServiceLoader code searches the classloader
for all of these files and iterates through them to pick up each jar's
META-INF/services/org.apache.tika.parser.Parser entries and adds them to
the list.  I've updated the code on github to include one per bundle.
This might be the way to go.

ex.
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-image-parser-bundle/src/main/resources/META-INF/services


- Bob

On 8/3/2015 9:21 PM, Allison, Timothy B. wrote:

+1 to moving the source to bundles.  I think for a 2.0 would be easier

to consolidate into a parser uber jar than trying to tease things out
like I did in the straw man impl. However deciding how to break things
up might take some experimentation.

Y, and the strawman is a great easy entry down this path towards 2.0.  I think 
the main hangup will be coming to consensus about granularity and nature of the 
packages, but we can burn that bridge when we get to it.  There are some 

[jira] [Updated] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1699:

Attachment: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt

- here's the patch that Nick backed out in case folks want to use it while we 
get the Jars published to Central.

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11

 Attachments: TIKA-1699.grobid-core.MattmannShah.081515.patch.txt


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698307#comment-14698307
 ] 

Nick Burch commented on TIKA-1706:
--

[~thetaphi] We currently have the forbidden apis check defined in the 
tika-parent pom. I've just tried adding 
{{{bundledSignaturecommons-io-unsafe-2.4/bundledSignature}}} there too, but 
that then causes the build of {{{tika-core}}} to fail, as core doesn't (yet) 
have commons-io available. Is there a way to make it skip the check if the 
classes aren't found, but do it if they are?

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698313#comment-14698313
 ] 

Uwe Schindler commented on TIKA-1706:
-

Yes, you can add the maven property 
{{failOnUnresolvableSignaturesfalse/failOnUnresolvableSignatures to the 
plugin configuration}}: 
[http://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/check-mojo.html#failOnUnresolvableSignatures]

An alternative is to only enable commons-io-unsafe-2.4 only for those modules 
where its used, unfortunately this is not so easy, because you cannot inherit 
only some array values to submodules, you miust reconfigure all 
bundledsignatures in submodules.

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698317#comment-14698317
 ] 

Hudson commented on TIKA-1699:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #825 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/825/])
Back out r1695816, so the build can pass again, pending a fix of the broken 
grobid poms. Fix being tracked in TIKA-1699 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1696054)
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf


 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698315#comment-14698315
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

bq. I've tried to exclude the grobid transient dependencies to work around this 
problem, but even an exclude of * still breaks the build on 
org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo 
definition. Unfortunately, I've therefore had to back out your r1695816, in 
order to unbreak the build. Hopefully we can get the grobid community to sort 
that shortly, and we can restore it!

yeah we're working with them to getting this fixed.

bq. On other possible issue spotted while failing to work around the broken pom 
- the grobid-core jar seems to be almost 15mb in size! Plus its dependencies 
themselves. That means we'll increase the size of the tika-app, tika-server and 
tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could 
depend on instead, which doesn't cause such a bump in our dependency sizes and 
jars?

Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/

Tika-app is ~48MB it seems so closer to 30% actually size increase. As for 
depending on a smaller core Jar, I had an idea here. Grobid has a server, I 
wonder if we should just connect to its REST server? [~sujenshah] In that 
fashion we could omit adding really any dependencies beyond CXF and its 
WebClient. I'll investigate this.


 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

2015-08-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698340#comment-14698340
 ] 

Chris A. Mattmann commented on TIKA-1699:
-

All filed issues to publish all grobid-core deps:
Wapiti jar fork:
https://issues.sonatype.org/browse/OSSRH-17124
EUGFC ImageIO plugin:
https://issues.sonatype.org/browse/OSSRH-17126
Language Detection: 
https://issues.sonatype.org/browse/OSSRH-17127
Chasen CRFPP: 
https://issues.sonatype.org/browse/OSSRH-17128
WIPO analysers: 
https://issues.sonatype.org/browse/OSSRH-17129 

That should be all of them. Will let everyone know once it's published.

 Integrate the GROBID PDF extractor in Tika
 --

 Key: TIKA-1699
 URL: https://issues.apache.org/jira/browse/TIKA-1699
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11


 GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning 
 library for extracting, parsing and re-structuring raw documents such as PDF 
 into structured TEI-encoded documents with a particular focus on technical 
 and scientific publications.
 It has a java api which can be used to augment PDF parsing for journals and 
 help extract extra metadata about the paper like authors, publication, 
 citations, etc. 
 It would be nice to have this integrated into Tika, I have tried it on my 
 local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)