[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available

2021-12-13 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458683#comment-17458683
 ] 

Bob Paulin commented on TIKA-3164:
--

Hey [~tallison] .  See the mention but will likely not get to this for a few 
day.  Did a few tests yesterday and I'm able to recreate your results on my 
machine but don't have any specific recommendations yet.

> Upgrade to POI 5.0.0 when available
> ---
>
> Key: TIKA-3164
> URL: https://issues.apache.org/jira/browse/TIKA-3164
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3591) The Import-Package of commons.io is wrong in MANIFEST.MF

2021-11-17 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445479#comment-17445479
 ] 

Bob Paulin commented on TIKA-3591:
--

Actually take that back it looks like commons-io is exporting at BOTH versions 
so maybe ok to keep your requirement.
{code:java}
Export-Package: org.apache.commons.io;version="1.4.",

...
org.apache.commons.io
 ;version="2.10.0" {code}

> The Import-Package of commons.io is wrong in MANIFEST.MF
> 
>
> Key: TIKA-3591
> URL: https://issues.apache.org/jira/browse/TIKA-3591
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Per Kristian Söreide
>Priority: Major
>
> The tika-bundle-standard bundle get version 1.4 of commons.io which leads to 
> NoSuchMethodException in some Parsers/Detectors since they require later 
> versions of commons.io.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3591) The Import-Package of commons.io is wrong in MANIFEST.MF

2021-11-17 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445474#comment-17445474
 ] 

Bob Paulin edited comment on TIKA-3591 at 11/17/21, 8:29 PM:
-

 
{quote}I agree that's what commons-io is telling tika-core to do, but tika-core 
won't compile with 1.4... for kicks, I switched the commons-io dependency to 
1.4 and it was a disaster. If we revert the bundle import in tika-core to let 
nature take its course, will it all just work?@Mention someone by typing their 
name...
{quote}
The 1.4 is set at the package level so it's NOT trying to say that tika can be 
compiled with 1.4 JAR (as there are other packages).  So given commons-io jar 
is only exporting the org.apache.commons.io package at version 1.4.999 if you 
exclude packages that are < 2.0 you may not get tika-core to resolve properly 
by itself in an osgi runtime.

 


was (Author: bob):
 
{quote}I agree that's what commons-io is telling tika-core to do, but tika-core 
won't compile with 1.4... for kicks, I switched the commons-io dependency to 
1.4 and it was a disaster. If we revert the bundle import in tika-core to let 
nature take its course, will it all just work?@Mention someone by typing their 
name...
{quote}
The 1.4 is set at the package level so it's trying to say that tika can be 
compiled with 1.4 JAR (as there are other packages).  So given commons-io jar 
is only exporting the org.apache.commons.io package at version 1.4.999 if you 
exclude packages that are < 2.0 you may not get tika-core to resolve properly 
by itself in an osgi runtime.

 

> The Import-Package of commons.io is wrong in MANIFEST.MF
> 
>
> Key: TIKA-3591
> URL: https://issues.apache.org/jira/browse/TIKA-3591
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Per Kristian Söreide
>Priority: Major
>
> The tika-bundle-standard bundle get version 1.4 of commons.io which leads to 
> NoSuchMethodException in some Parsers/Detectors since they require later 
> versions of commons.io.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3591) The Import-Package of commons.io is wrong in MANIFEST.MF

2021-11-17 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445474#comment-17445474
 ] 

Bob Paulin commented on TIKA-3591:
--

 
{quote}I agree that's what commons-io is telling tika-core to do, but tika-core 
won't compile with 1.4... for kicks, I switched the commons-io dependency to 
1.4 and it was a disaster. If we revert the bundle import in tika-core to let 
nature take its course, will it all just work?@Mention someone by typing their 
name...
{quote}
The 1.4 is set at the package level so it's trying to say that tika can be 
compiled with 1.4 JAR (as there are other packages).  So given commons-io jar 
is only exporting the org.apache.commons.io package at version 1.4.999 if you 
exclude packages that are < 2.0 you may not get tika-core to resolve properly 
by itself in an osgi runtime.

 

> The Import-Package of commons.io is wrong in MANIFEST.MF
> 
>
> Key: TIKA-3591
> URL: https://issues.apache.org/jira/browse/TIKA-3591
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Per Kristian Söreide
>Priority: Major
>
> The tika-bundle-standard bundle get version 1.4 of commons.io which leads to 
> NoSuchMethodException in some Parsers/Detectors since they require later 
> versions of commons.io.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3591) The Import-Package of commons.io is wrong in MANIFEST.MF

2021-11-17 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445471#comment-17445471
 ] 

Bob Paulin commented on TIKA-3591:
--

Hey [~tallison] .  The commons-io library is what's saying hey you can use 1.4 
if you'd like  [1].  So even though you're saying use 2.11.0 commons-io is 
going to export those packages with the 1.4 version.  So I don't think it's 
wrong that tika-core imports 1.4 in this case since that's what commons-io is 
telling it to do.  I think the trouble is with the tika-bundle-standard.  If 
the commons-io classes are embedded in the bundle it should NOT need to import 
it as it is instructed to do [2].  I'd suggest removing that line rather than 
explicitly setting the versions.  Having the jar embedded AND importing those 
packages can lead to unpredictable results based on the runtime.  


 

 

 
[1][https://github.com/apache/commons-io/blob/884e5ecee572856804a2303e0bb53ab15f1a7543/pom.xml#L320]

 [2] 
[https://github.com/apache/tika/blob/f750410ce5ec206acca156a71de6b32396a78416/tika-bundles/tika-bundle-standard/pom.xml#L222]

 

 

> The Import-Package of commons.io is wrong in MANIFEST.MF
> 
>
> Key: TIKA-3591
> URL: https://issues.apache.org/jira/browse/TIKA-3591
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Per Kristian Söreide
>Priority: Major
>
> The tika-bundle-standard bundle get version 1.4 of commons.io which leads to 
> NoSuchMethodException in some Parsers/Detectors since they require later 
> versions of commons.io.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3178) Tika 2.0.0 -- Add back OSGi bundles for Tika parsers

2020-12-17 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251496#comment-17251496
 ] 

Bob Paulin commented on TIKA-3178:
--

It looks like the xerces issue is caused by the test harness. I added a 
classpath exclusion that appears to remove the error.  The other changes looks 
like a sure improvement as the embedding is only in the osgi bundle now. 

> Tika 2.0.0 -- Add back OSGi bundles for Tika parsers
> 
>
> Key: TIKA-3178
> URL: https://issues.apache.org/jira/browse/TIKA-3178
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>  Labels: 2.0.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3178) Tika 2.0.0 -- Add back OSGi bundles for Tika parsers

2020-12-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245495#comment-17245495
 ] 

Bob Paulin commented on TIKA-3178:
--

Hey [~tallison] just found some time to review this.  While the approach works 
there are some disadvantages to statically including commons-io in these JARs.  
The first and largest is that we're inlining and shipping commons-io with all 
the jar files.  This will increase the size and couple commons-io in a way that 
may create classloader issues in non-osgi environments.  For example if another 
dependency used with tika requires commons-io you'll have more than 1 version 
of the class and the behavior will be determined by which class loads first.  
This is not ideal.  At a minimum we need to make sure we keep tika-core clean 
while we work through this.   I can take a look at it over the next few days 
but the biggest issue I see is the inlining in the tika-core for the reason 
above.

> Tika 2.0.0 -- Add back OSGi bundles for Tika parsers
> 
>
> Key: TIKA-3178
> URL: https://issues.apache.org/jira/browse/TIKA-3178
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>  Labels: 2.0.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3185) tika-parsers-integration-test fails on windows with File being used by another process.

2020-08-22 Thread Bob Paulin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-3185.
--
Resolution: Fixed

> tika-parsers-integration-test fails on windows with File being used by 
> another process.
> ---
>
> Key: TIKA-3185
> URL: https://issues.apache.org/jira/browse/TIKA-3185
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Minor
> Fix For: 2.0.0
>
>
> The build for tika-parsers-integration-test fails on Windows for both
> TestXXEInXML.testPOIOOXMLs
> TestXXEInXML.testXMLInZips
>  
> The process cannot access the file because it is being used by another 
> process.
>  
> It appears this is because the input zip file is not explicitly closed in the 
> following function.
>  
> XMLTestBase.injectZippedXMLs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3185) tika-parsers-integration-test fails on windows with File being used by another process.

2020-08-22 Thread Bob Paulin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-3185:
-
Fix Version/s: 2.0.0
Affects Version/s: 2.0.0

> tika-parsers-integration-test fails on windows with File being used by 
> another process.
> ---
>
> Key: TIKA-3185
> URL: https://issues.apache.org/jira/browse/TIKA-3185
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Minor
> Fix For: 2.0.0
>
>
> The build for tika-parsers-integration-test fails on Windows for both
> TestXXEInXML.testPOIOOXMLs
> TestXXEInXML.testXMLInZips
>  
> The process cannot access the file because it is being used by another 
> process.
>  
> It appears this is because the input zip file is not explicitly closed in the 
> following function.
>  
> XMLTestBase.injectZippedXMLs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-3185) tika-parsers-integration-test fails on windows with File being used by another process.

2020-08-22 Thread Bob Paulin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin reassigned TIKA-3185:


Assignee: Bob Paulin

> tika-parsers-integration-test fails on windows with File being used by 
> another process.
> ---
>
> Key: TIKA-3185
> URL: https://issues.apache.org/jira/browse/TIKA-3185
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Minor
>
> The build for tika-parsers-integration-test fails on Windows for both
> TestXXEInXML.testPOIOOXMLs
> TestXXEInXML.testXMLInZips
>  
> The process cannot access the file because it is being used by another 
> process.
>  
> It appears this is because the input zip file is not explicitly closed in the 
> following function.
>  
> XMLTestBase.injectZippedXMLs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3185) tika-parsers-integration-test fails on windows with File being used by another process.

2020-08-22 Thread Bob Paulin (Jira)
Bob Paulin created TIKA-3185:


 Summary: tika-parsers-integration-test fails on windows with File 
being used by another process.
 Key: TIKA-3185
 URL: https://issues.apache.org/jira/browse/TIKA-3185
 Project: Tika
  Issue Type: Bug
Reporter: Bob Paulin


The build for tika-parsers-integration-test fails on Windows for both

TestXXEInXML.testPOIOOXMLs

TestXXEInXML.testXMLInZips

 

The process cannot access the file because it is being used by another process.

 

It appears this is because the input zip file is not explicitly closed in the 
following function.

 

XMLTestBase.injectZippedXMLs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3178) Tika 2.0.0 -- Add back OSGi bundles for Tika parsers

2020-08-21 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182112#comment-17182112
 ] 

Bob Paulin edited comment on TIKA-3178 at 8/21/20, 7:50 PM:


ok I get past that part of the build now.  Thanks [~tallison] .

 

It's probably worth revisiting separate OSGi convenience bundles per module on 
the dev list.  Having each module built as a bundle packaging was 
uncontroversial however I recall having a tika-parser-*-bundle for each one 
was.  I also recall tika-bundle being a bit of a experiment as well on 2.0.


was (Author: bob):
ok I get past that part of the build now.  Thanks [~tallison] .

 

It's probably worth revisiting separate OSGi convince bundles per module on the 
dev list.  Having each module built as a bundle packaging was uncontroversial 
however I recall having a tika-parser-*-bundle for each one was.  I also recall 
tika-bundle being a bit of a experiment as well on 2.0.

> Tika 2.0.0 -- Add back OSGi bundles for Tika parsers
> 
>
> Key: TIKA-3178
> URL: https://issues.apache.org/jira/browse/TIKA-3178
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>  Labels: 2.0.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3178) Tika 2.0.0 -- Add back OSGi bundles for Tika parsers

2020-08-21 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182112#comment-17182112
 ] 

Bob Paulin commented on TIKA-3178:
--

ok I get past that part of the build now.  Thanks [~tallison] .

 

It's probably worth revisiting separate OSGi convince bundles per module on the 
dev list.  Having each module built as a bundle packaging was uncontroversial 
however I recall having a tika-parser-*-bundle for each one was.  I also recall 
tika-bundle being a bit of a experiment as well on 2.0.

> Tika 2.0.0 -- Add back OSGi bundles for Tika parsers
> 
>
> Key: TIKA-3178
> URL: https://issues.apache.org/jira/browse/TIKA-3178
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>  Labels: 2.0.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3178) Tika 2.0.0 -- Add back OSGi bundles for Tika parsers

2020-08-21 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182090#comment-17182090
 ] 

Bob Paulin commented on TIKA-3178:
--

Also I'm getting the following when building.  Seems like the 
maven-bundle-plugin needs to be there to use the bundle packaging.

 
{code:java}
 @ 
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]   
[ERROR]   The project org.apache.tika:tika-parsers:2.0.0-SNAPSHOT 
(C:\Users\bpaulin\git\tika-2.0\tika-parsers\pom.xml) has 1 error
[ERROR] Unknown packaging: bundle @ line 33, column 14
[ERROR]  {code}

> Tika 2.0.0 -- Add back OSGi bundles for Tika parsers
> 
>
> Key: TIKA-3178
> URL: https://issues.apache.org/jira/browse/TIKA-3178
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>  Labels: 2.0.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3178) Tika 2.0.0 -- Add back OSGi bundles for Tika parsers

2020-08-21 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182089#comment-17182089
 ] 

Bob Paulin commented on TIKA-3178:
--

Is this for recreating tika-bundle?  Or are we looking to create individual 
bundles for each module?

> Tika 2.0.0 -- Add back OSGi bundles for Tika parsers
> 
>
> Key: TIKA-3178
> URL: https://issues.apache.org/jira/browse/TIKA-3178
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>  Labels: 2.0.0
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137
 ] 

Bob Paulin edited comment on TIKA-3094 at 5/8/20, 1:02 AM:
---

Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
classpath and expose them to the bundle from the system packages.  I do not see 
the error on java 11 or 8 now.


was (Author: bob):
Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
classpath and expose them to the bundle from the system class loader.  I do not 
see the error on java 11 or 8 now.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137
 ] 

Bob Paulin edited comment on TIKA-3094 at 5/8/20, 1:02 AM:
---

Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
classpath and expose them to the bundle from the system class loader.  I do not 
see the error on java 11 or 8 now.


was (Author: bob):
Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
class loader and expose them to the bundle from the system class loader.  I do 
not see the error on java 11 or 8 now.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137
 ] 

Bob Paulin commented on TIKA-3094:
--

Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
class loader and expose them to the bundle from the system class loader.  I do 
not see the error on java 11 or 8 now.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-05 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099868#comment-17099868
 ] 

Bob Paulin commented on TIKA-3094:
--

Thanks [~tallison] .  For #2 JAXB was removed from the JDK distribution in Java 
11 so not surprised there.  We should be able to correct that by bringing in 
the dependencies.  A little surprised you hit it on Java 8 since it's still in 
the JDK then but OSGi might be hiding it.  Will take a look.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-05 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099848#comment-17099848
 ] 

Bob Paulin commented on TIKA-3094:
--

Hey [~tallison] I ran a build on Java 8 and Java 11 and I was unable to 
recreate #2.  Can you provide more details?  perhaps the output of the stack 
trace you're getting?

For #3 I do get errors running the build but I'm  not sure which are expected 
and which are the ones with the wrong metadata.  Can you provide an example.

 

Also it might be helpful to separate these out into different JIRAs.  This one 
is snowballing a bit.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3095) tika-bundle tests fail on windows due to missing jcip-annotations

2020-04-29 Thread Bob Paulin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-3095.
--
Fix Version/s: 1.25
   Resolution: Fixed

> tika-bundle tests fail on windows due to missing jcip-annotations
> -
>
> Key: TIKA-3095
> URL: https://issues.apache.org/jira/browse/TIKA-3095
> Project: Tika
>  Issue Type: Bug
> Environment: Windows
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Minor
> Fix For: 1.25
>
>
> The tika-bundle build fails with tests on windows due to a missing 
> jcip-annotations dependency that is required by grib.  This can be added 
> explicity to the tika-parsers project to prevent this from occurring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3095) tika-bundle tests fail on windows due to missing jcip-annotations

2020-04-29 Thread Bob Paulin (Jira)
Bob Paulin created TIKA-3095:


 Summary: tika-bundle tests fail on windows due to missing 
jcip-annotations
 Key: TIKA-3095
 URL: https://issues.apache.org/jira/browse/TIKA-3095
 Project: Tika
  Issue Type: Bug
 Environment: Windows
Reporter: Bob Paulin
Assignee: Bob Paulin


The tika-bundle build fails with tests on windows due to a missing 
jcip-annotations dependency that is required by grib.  This can be added 
explicity to the tika-parsers project to prevent this from occurring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-29 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095917#comment-17095917
 ] 

Bob Paulin commented on TIKA-3094:
--

Fixed with 
https://github.com/apache/tika/commit/6789674dd273fbd07350d8a7dfc193e1da34aeb8

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-29 Thread Bob Paulin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin reassigned TIKA-3094:


Assignee: Bob Paulin

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-29 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095883#comment-17095883
 ] 

Bob Paulin commented on TIKA-3094:
--

Embedding SparseBitSet in Embed-Dependency fixes the issue.  Will be submitting 
a fix to branch_1x this evening.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094614#comment-17094614
 ] 

Bob Paulin commented on TIKA-3094:
--

Thanks [~abchauha] .  The build process adds OSGi specific headers so I'm not 
surprised the approach you describe above was unsuccessful.  The 
maven-bundle-plugin will ensure it is embedded properly.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094517#comment-17094517
 ] 

Bob Paulin commented on TIKA-3094:
--

If SparseBitSet is embedded in the tika-bundle that the library itself doesn't 
need the OSGi headers as tika-bundle will provide the headers.  It looks like 
it's Apache Licensed so we should be able to do this.  Can you provide a .pptx 
that is failing?  Would be great if we could integrate into the test suite.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2987) Extracting Metadata from JPEG Fails with Tika Bundle

2019-11-19 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977564#comment-16977564
 ] 

Bob Paulin commented on TIKA-2987:
--

I happened to have the source of your project on my pc from a evaluation I was 
doing last month.  This is a cool add to the Sling project.  I see 2 issues 
when I run the build off of master.  The first is that the tika-bundle appears 
to be at 1.21 while tika-core and tika-parsers are at 1.22.  These versions 
should match so upgrade your tika-bundle to 1.22.  The second is that 
tika-parsers is included with tika-bundle.  The tika-bundle embeds tika-parsers 
and exports it's packages so you'll have 2 bundles exporting the same packages. 
 The error occurs when tika-core is searching for imports it might find the 
tika-parsers instead of tika-bundle.  Since tika-parsers does not have the 
dependencies embedded and tika-bundle does not export the dependency package 
the class is not found.  This also explains why your approach to bundling the 
metadata-extractor did not work.  The tika-parsers include is probably a hold 
over from sling.  It might be a good idea to see if we can get sling to use the 
bundle or you could look at excluding tika-parsers from your project.   When I 
remove tika-parsers from the osgi runtime the class is found successfully:

 
{code:java}
19.11.2019 09:19:27.684 *INFO* [oak-executor-68] 
org.apache.sling.cms.core.internal.FileMetadataExtractorImpl Extracting 
metadata from /static/DJI_0032.JPG
19.11.2019 09:19:28.429 *INFO* [oak-executor-68] 
org.apache.sling.cms.core.internal.FileMetadataExtractorImpl Metadata extracted 
from /static/DJI_0032.JPG
19.11.2019 09:19:28.432 *INFO* [oak-executor-64] 
org.apache.sling.cms.core.internal.FileMetadataExtractorImpl Extracting 
metadata from /static/DJI_0032.JPG
19.11.2019 09:19:28.518 *INFO* [oak-executor-64] 
org.apache.sling.cms.core.internal.FileMetadataExtractorImpl Metadata extracted 
from /static/DJI_0032.JPG
19.11.2019 09:19:33.582 *INFO* [oak-executor-68] 
org.apache.sling.cms.core.internal.FileMetadataExtractorImpl Extracting 
metadata from /static/DJI_0032.JPG
19.11.2019 09:19:33.663 *INFO* [oak-executor-68] 
org.apache.sling.cms.core.internal.FileMetadataExtractorImpl Metadata extracted 
from /static/DJI_0032.JPG {code}

> Extracting Metadata from JPEG Fails with Tika Bundle
> 
>
> Key: TIKA-2987
> URL: https://issues.apache.org/jira/browse/TIKA-2987
> Project: Tika
>  Issue Type: New Feature
>Affects Versions: 1.21
>Reporter: Dan Klco
>Priority: Major
>
> When attempting to extract metadata from a JPEG image with Tika OSGi Bundle, 
> it fails with the following exception:
> {code:java}
> 18.11.2019 14:57:26.000 *ERROR* [oak-executor-37] 
> org.apache.jackrabbit.oak.plugins.observation.FilteringDispatcher Uncaught 
> exception in 
> org.apache.jackrabbit.oak.plugins.observation.FilteringDispatcher@139d97c618.11.2019
>  14:57:26.000 *ERROR* [oak-executor-37] 
> org.apache.jackrabbit.oak.plugins.observation.FilteringDispatcher Uncaught 
> exception in 
> org.apache.jackrabbit.oak.plugins.observation.filteringdispatc...@139d97c6java.lang.NoClassDefFoundError:
>  com/drew/imaging/jpeg/JpegProcessingException at 
> org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:58) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> [org.apache.tika.core:1.22.0] at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> [org.apache.tika.core:1.22.0] at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> [org.apache.tika.core:1.22.0] at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> [org.apache.tika.core:1.22.0] at 
> org.apache.sling.cms.core.internal.FileMetadataExtractorImpl.extractMetadata(FileMetadataExtractorImpl.java:148)
>  [org.apache.sling.cms.core:0.12.1.SNAPSHOT] at 
> org.apache.sling.cms.core.internal.FileMetadataExtractorImpl.updateMetadata(FileMetadataExtractorImpl.java:123)
>  [org.apache.sling.cms.core:0.12.1.SNAPSHOT] at 
> org.apache.sling.cms.core.internal.listeners.FileMetadataExtractorListener.handleChange(FileMetadataExtractorListener.java:59)
>  [org.apache.sling.cms.core:0.12.1.SNAPSHOT] at 
> org.apache.sling.cms.core.internal.listeners.FileMetadataExtractorListener.onChange(FileMetadataExtractorListener.java:74)
>  [org.apache.sling.cms.core:0.12.1.SNAPSHOT] at 
> org.apache.sling.resourceresolver.impl.observation.BasicObservationReporter.reportChanges(BasicObservationReporter.java:211)
>  [org.apache.sling.resourceresolver:1.6.14] at 
> org.apache.sling.jcr.resource.internal.JcrResourceListener.onEvent(JcrResourceListener.java:155)
>  [org.apache.sling.jcr.resource:3.0.18] at 
> 

[jira] [Commented] (TIKA-2941) OSGI bundle and app are not self-contained

2019-10-08 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947341#comment-16947341
 ] 

Bob Paulin commented on TIKA-2941:
--

Looks like there's a configuration option that does exactly what we need.

Added
{code:java}
true {code}
to the maven-bundle-plugin and the embedded dependencies are removed from the 
pom.  Added to branch_1x and master.

 

> OSGI bundle and app are not self-contained
> --
>
> Key: TIKA-2941
> URL: https://issues.apache.org/jira/browse/TIKA-2941
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.22
>Reporter: Peng Cheng
>Priority: Major
>
> Tika bundle still have dependencies spilled out of its package and cause jar 
> hell everywhere. If tika bundle is declared in maven as a dependency, a maven 
> dependency:tree will indicate:
> [INFO] | +- org.apache.tika:tika-bundle:jar:1.22:test
>  [INFO] | | +- org.apache.tika:tika-core:jar:1.22:test
>  [INFO] | | - org.apache.tika:tika-parsers:jar:1.22:test
>  [INFO] | | +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:test
>  [INFO] | | | +- jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:test
>  [INFO] | | | +- org.glassfish.jaxb:txw2:jar:2.3.2:test
>  [INFO] | | | +- com.sun.istack:istack-commons-runtime:jar:3.0.8:test
>  [INFO] | | | +- org.jvnet.staxex:stax-ex:jar:1.8.1:test
>  [INFO] | | | - com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:test
>  [INFO] | | +- com.sun.activation:jakarta.activation:jar:1.2.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-tika:jar:0.8:test
>  [INFO] | | +- org.tallison:jmatio:jar:1.5:test
>  [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.3:test
>  [INFO] | | +- org.apache.james:apache-mime4j-dom:jar:0.8.3:test
>  [INFO] | | +- com.epam:parso:jar:2.0.11:test
>  [INFO] | | +- org.brotli:dec:jar:0.1.2:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.16:test
>  [INFO] | | | - org.apache.pdfbox:fontbox:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:jempbox:jar:1.8.16:test
>  [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.62:test
>  [INFO] | | | - org.bouncycastle:bcpkix-jdk15on:jar:1.62:test
>  [INFO] | | +- org.bouncycastle:bcprov-jdk15on:jar:1.62:test
>  [INFO] | | +- org.apache.poi:poi:jar:4.0.1:test
>  [INFO] | | | - org.apache.commons:commons-collections4:jar:4.2:test
>  [INFO] | | +- org.apache.poi:poi-scratchpad:jar:4.0.1:test
>  [INFO] | | +- org.apache.poi:poi-ooxml:jar:4.0.1:test
>  [INFO] | | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.1:test
>  [INFO] | | | | - org.apache.xmlbeans:xmlbeans:jar:3.0.2:test
>  [INFO] | | | - com.github.virtuald:curvesapi:jar:1.05:test
>  [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:test
>  [INFO] | | +- 
> com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:test
>  [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:test
>  [INFO] | | +- org.ow2.asm:asm:jar:7.2-beta:test
>  [INFO] | | +- com.googlecode.mp4parser:isoparser:jar:1.1.22:test
>  [INFO] | | +- com.drewnoakes:metadata-extractor:jar:2.11.0:test
>  [INFO] | | | - com.adobe.xmp:xmpcore:jar:5.1.3:test
>  [INFO] | | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:test
>  [INFO] | | +- com.rometools:rome:jar:1.12.1:test
>  [INFO] | | | - com.rometools:rome-utils:jar:1.12.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-core:jar:0.8:test
>  [INFO] | | +- org.codelibs:jhighlight:jar:1.0.3:test
>  [INFO] | | +- com.pff:java-libpst:jar:0.8.1:test
>  [INFO] | | +- com.github.junrar:junrar:jar:4.0.0:test
>  [INFO] | | +- org.apache.cxf:cxf-rt-rs-client:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-rt-transports-http:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-core:jar:3.3.2:test
>  [INFO] | | | | +- com.fasterxml.woodstox:woodstox-core:jar:5.0.3:test
>  [INFO] | | | | | - org.codehaus.woodstox:stax2-api:jar:3.1.4:test
>  [INFO] | | | | +- org.apache.ws.xmlschema:xmlschema-core:jar:2.2.4:test
>  [INFO] | | | | - org.glassfish.jaxb:jaxb-xjc:jar:2.3.2:test
>  [INFO] | | | | +- org.glassfish.jaxb:xsom:jar:2.3.2:test
>  [INFO] | | | | +- org.glassfish.jaxb:codemodel:jar:2.3.2:test
>  [INFO] | | | | +- com.sun.xml.bind.external:rngom:jar:2.3.2:test
>  [INFO] | | | | +- com.sun.xml.dtd-parser:dtd-parser:jar:1.4.1:test
>  [INFO] | | | | +- com.sun.istack:istack-commons-tools:jar:3.0.8:test
>  [INFO] | | | | - com.sun.xml.bind.external:relaxng-datatype:jar:2.3.2:test
>  [INFO] | | | - org.apache.cxf:cxf-rt-frontend-jaxrs:jar:3.3.2:test
>  [INFO] | | | +- jakarta.ws.rs:jakarta.ws.rs-api:jar:2.1.5:test
>  [INFO] | | | - org.apache.cxf:cxf-rt-security:jar:3.3.2:test
>  [INFO] | | +- org.apache.opennlp:opennlp-tools:jar:1.9.1:test
>  [INFO] | | +- com.googlecode.json-simple:json-simple:jar:1.1.1:test
>  [INFO] | | +- com.github.openjson:openjson:jar:1.0.11:test
>  [INFO] | | +- 

[jira] [Commented] (TIKA-2941) OSGI bundle and app are not self-contained

2019-10-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945889#comment-16945889
 ] 

Bob Paulin commented on TIKA-2941:
--

Just an update to provide some transparency around the "why" we got here.  With 
the newer version of the maven-bundle-plugin when I revert my commit from 
before I do not see the transitive dependencies included if the tika-parsers 
are in provided scope.  With tika-parsers being embedded it does not really 
make sense for it to be in provided scope anyways.  However with tika-parsers 
as a compile time dependency all the transitive dependencies are included in 
maven which is what is being called out as the issue in this JIRA.  The good 
thing from an OSGi perspective we're still OK since only the following packages 
are exported:

 
{code:java}

  !org.apache.tika.parser,
  !org.apache.tika.parser.external,
  org.apache.tika.parser.*,
  org.apache.tika.metadata.serialization.*,
 {code}
 

But the maven side still shows all the transitive dependencies coming through.  
So in an OSGi runtime all these packages are private as expected but in the 
development environment this is a bit confusing since maven shows them coming 
through.  Will need some time to see if we can get the maven side of this 
equation right without breaking the OSGi side.   Hopefully this helps provide 
some context around the problem we're solving.

 

> OSGI bundle and app are not self-contained
> --
>
> Key: TIKA-2941
> URL: https://issues.apache.org/jira/browse/TIKA-2941
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.22
>Reporter: Peng Cheng
>Priority: Major
>
> Tika bundle still have dependencies spilled out of its package and cause jar 
> hell everywhere. If tika bundle is declared in maven as a dependency, a maven 
> dependency:tree will indicate:
> [INFO] | +- org.apache.tika:tika-bundle:jar:1.22:test
>  [INFO] | | +- org.apache.tika:tika-core:jar:1.22:test
>  [INFO] | | - org.apache.tika:tika-parsers:jar:1.22:test
>  [INFO] | | +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:test
>  [INFO] | | | +- jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:test
>  [INFO] | | | +- org.glassfish.jaxb:txw2:jar:2.3.2:test
>  [INFO] | | | +- com.sun.istack:istack-commons-runtime:jar:3.0.8:test
>  [INFO] | | | +- org.jvnet.staxex:stax-ex:jar:1.8.1:test
>  [INFO] | | | - com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:test
>  [INFO] | | +- com.sun.activation:jakarta.activation:jar:1.2.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-tika:jar:0.8:test
>  [INFO] | | +- org.tallison:jmatio:jar:1.5:test
>  [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.3:test
>  [INFO] | | +- org.apache.james:apache-mime4j-dom:jar:0.8.3:test
>  [INFO] | | +- com.epam:parso:jar:2.0.11:test
>  [INFO] | | +- org.brotli:dec:jar:0.1.2:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.16:test
>  [INFO] | | | - org.apache.pdfbox:fontbox:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:jempbox:jar:1.8.16:test
>  [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.62:test
>  [INFO] | | | - org.bouncycastle:bcpkix-jdk15on:jar:1.62:test
>  [INFO] | | +- org.bouncycastle:bcprov-jdk15on:jar:1.62:test
>  [INFO] | | +- org.apache.poi:poi:jar:4.0.1:test
>  [INFO] | | | - org.apache.commons:commons-collections4:jar:4.2:test
>  [INFO] | | +- org.apache.poi:poi-scratchpad:jar:4.0.1:test
>  [INFO] | | +- org.apache.poi:poi-ooxml:jar:4.0.1:test
>  [INFO] | | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.1:test
>  [INFO] | | | | - org.apache.xmlbeans:xmlbeans:jar:3.0.2:test
>  [INFO] | | | - com.github.virtuald:curvesapi:jar:1.05:test
>  [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:test
>  [INFO] | | +- 
> com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:test
>  [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:test
>  [INFO] | | +- org.ow2.asm:asm:jar:7.2-beta:test
>  [INFO] | | +- com.googlecode.mp4parser:isoparser:jar:1.1.22:test
>  [INFO] | | +- com.drewnoakes:metadata-extractor:jar:2.11.0:test
>  [INFO] | | | - com.adobe.xmp:xmpcore:jar:5.1.3:test
>  [INFO] | | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:test
>  [INFO] | | +- com.rometools:rome:jar:1.12.1:test
>  [INFO] | | | - com.rometools:rome-utils:jar:1.12.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-core:jar:0.8:test
>  [INFO] | | +- org.codelibs:jhighlight:jar:1.0.3:test
>  [INFO] | | +- com.pff:java-libpst:jar:0.8.1:test
>  [INFO] | | +- com.github.junrar:junrar:jar:4.0.0:test
>  [INFO] | | +- org.apache.cxf:cxf-rt-rs-client:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-rt-transports-http:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-core:jar:3.3.2:test
>  [INFO] | | | | +- 

[jira] [Commented] (TIKA-2941) OSGI bundle and app are not self-contained

2019-10-04 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944648#comment-16944648
 ] 

Bob Paulin commented on TIKA-2941:
--

Yeah I can take a look.  I reviewed some of the lists and it doesn't appear I 
was transparent in why I made the change in that way.  Could have had something 
to do with the plugin upgrade but I"d need to play around with it a bit to get 
my head wrapped around it again. 

> OSGI bundle and app are not self-contained
> --
>
> Key: TIKA-2941
> URL: https://issues.apache.org/jira/browse/TIKA-2941
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.22
>Reporter: Peng Cheng
>Priority: Major
>
> Tika bundle still have dependencies spilled out of its package and cause jar 
> hell everywhere. If tika bundle is declared in maven as a dependency, a maven 
> dependency:tree will indicate:
> [INFO] | +- org.apache.tika:tika-bundle:jar:1.22:test
>  [INFO] | | +- org.apache.tika:tika-core:jar:1.22:test
>  [INFO] | | - org.apache.tika:tika-parsers:jar:1.22:test
>  [INFO] | | +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:test
>  [INFO] | | | +- jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:test
>  [INFO] | | | +- org.glassfish.jaxb:txw2:jar:2.3.2:test
>  [INFO] | | | +- com.sun.istack:istack-commons-runtime:jar:3.0.8:test
>  [INFO] | | | +- org.jvnet.staxex:stax-ex:jar:1.8.1:test
>  [INFO] | | | - com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:test
>  [INFO] | | +- com.sun.activation:jakarta.activation:jar:1.2.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-tika:jar:0.8:test
>  [INFO] | | +- org.tallison:jmatio:jar:1.5:test
>  [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.3:test
>  [INFO] | | +- org.apache.james:apache-mime4j-dom:jar:0.8.3:test
>  [INFO] | | +- com.epam:parso:jar:2.0.11:test
>  [INFO] | | +- org.brotli:dec:jar:0.1.2:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.16:test
>  [INFO] | | | - org.apache.pdfbox:fontbox:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:jempbox:jar:1.8.16:test
>  [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.62:test
>  [INFO] | | | - org.bouncycastle:bcpkix-jdk15on:jar:1.62:test
>  [INFO] | | +- org.bouncycastle:bcprov-jdk15on:jar:1.62:test
>  [INFO] | | +- org.apache.poi:poi:jar:4.0.1:test
>  [INFO] | | | - org.apache.commons:commons-collections4:jar:4.2:test
>  [INFO] | | +- org.apache.poi:poi-scratchpad:jar:4.0.1:test
>  [INFO] | | +- org.apache.poi:poi-ooxml:jar:4.0.1:test
>  [INFO] | | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.1:test
>  [INFO] | | | | - org.apache.xmlbeans:xmlbeans:jar:3.0.2:test
>  [INFO] | | | - com.github.virtuald:curvesapi:jar:1.05:test
>  [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:test
>  [INFO] | | +- 
> com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:test
>  [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:test
>  [INFO] | | +- org.ow2.asm:asm:jar:7.2-beta:test
>  [INFO] | | +- com.googlecode.mp4parser:isoparser:jar:1.1.22:test
>  [INFO] | | +- com.drewnoakes:metadata-extractor:jar:2.11.0:test
>  [INFO] | | | - com.adobe.xmp:xmpcore:jar:5.1.3:test
>  [INFO] | | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:test
>  [INFO] | | +- com.rometools:rome:jar:1.12.1:test
>  [INFO] | | | - com.rometools:rome-utils:jar:1.12.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-core:jar:0.8:test
>  [INFO] | | +- org.codelibs:jhighlight:jar:1.0.3:test
>  [INFO] | | +- com.pff:java-libpst:jar:0.8.1:test
>  [INFO] | | +- com.github.junrar:junrar:jar:4.0.0:test
>  [INFO] | | +- org.apache.cxf:cxf-rt-rs-client:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-rt-transports-http:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-core:jar:3.3.2:test
>  [INFO] | | | | +- com.fasterxml.woodstox:woodstox-core:jar:5.0.3:test
>  [INFO] | | | | | - org.codehaus.woodstox:stax2-api:jar:3.1.4:test
>  [INFO] | | | | +- org.apache.ws.xmlschema:xmlschema-core:jar:2.2.4:test
>  [INFO] | | | | - org.glassfish.jaxb:jaxb-xjc:jar:2.3.2:test
>  [INFO] | | | | +- org.glassfish.jaxb:xsom:jar:2.3.2:test
>  [INFO] | | | | +- org.glassfish.jaxb:codemodel:jar:2.3.2:test
>  [INFO] | | | | +- com.sun.xml.bind.external:rngom:jar:2.3.2:test
>  [INFO] | | | | +- com.sun.xml.dtd-parser:dtd-parser:jar:1.4.1:test
>  [INFO] | | | | +- com.sun.istack:istack-commons-tools:jar:3.0.8:test
>  [INFO] | | | | - com.sun.xml.bind.external:relaxng-datatype:jar:2.3.2:test
>  [INFO] | | | - org.apache.cxf:cxf-rt-frontend-jaxrs:jar:3.3.2:test
>  [INFO] | | | +- jakarta.ws.rs:jakarta.ws.rs-api:jar:2.1.5:test
>  [INFO] | | | - org.apache.cxf:cxf-rt-security:jar:3.3.2:test
>  [INFO] | | +- org.apache.opennlp:opennlp-tools:jar:1.9.1:test
>  [INFO] | | +- com.googlecode.json-simple:json-simple:jar:1.1.1:test
>  [INFO] | | +- 

[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Bob Paulin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909126#comment-16909126
 ] 

Bob Paulin commented on TIKA-2882:
--

Happy to help review or answer questions on it.  Seems like we're getting 
critical mass on getting this change to happen!

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-30 Thread Bob Paulin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851811#comment-16851811
 ] 

Bob Paulin commented on TIKA-2882:
--

I think it might be a good idea bring these ideas to the list to put together a 
new 2.0 proposal with a small scope.  I like the idea of just making the 
release just about modules.  We can always do a 3.0 for the other items.  Also 
I'm fine on delaying on the OSGi aspects.  I'm working with some ideas that 
would eliminate the need for separate projects but I need to bake them some 
more.

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2719) Java 9: Requiring tika-parsers from module-info.java fails with "module not found"

2018-08-30 Thread Bob Paulin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597945#comment-16597945
 ] 

Bob Paulin commented on TIKA-2719:
--

{{org.apache.tika.core}} sounds more specific and better to me.

> Java 9: Requiring tika-parsers from module-info.java fails with "module not 
> found"
> --
>
> Key: TIKA-2719
> URL: https://issues.apache.org/jira/browse/TIKA-2719
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: James Baker
>Priority: Major
>  Labels: java9, module
>
> When requiring `tika.parsers` from a Java 9 `module-info.java`, Maven throws 
> an error about not being able to find `tika.parsers`:
>  
> {{[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.7.0:compile 
> (default-compile) on project annot8-components-tika: Compilation failure}}
> {{[ERROR] 
> /home/bakerj/annot8/annot8-components/annot8-components-tika/src/main/java/module-info.java:[9,16]
>  module not found: tika.parsers}}
>  
> It looks like this is likely to be a similar issue to: 
> [https://github.com/elastic/elasticsearch/issues/28984]
> For an example of a failing project, see: 
> [https://github.com/annot8/annot8-components/tree/tika]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2719) Java 9: Requiring tika-parsers from module-info.java fails with "module not found"

2018-08-30 Thread Bob Paulin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597897#comment-16597897
 ] 

Bob Paulin commented on TIKA-2719:
--

Yeah for some reason I thought we already released tika.core.  A fully 
qualified module for each project name seems like a good idea.

> Java 9: Requiring tika-parsers from module-info.java fails with "module not 
> found"
> --
>
> Key: TIKA-2719
> URL: https://issues.apache.org/jira/browse/TIKA-2719
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: James Baker
>Priority: Major
>  Labels: java9, module
>
> When requiring `tika.parsers` from a Java 9 `module-info.java`, Maven throws 
> an error about not being able to find `tika.parsers`:
>  
> {{[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.7.0:compile 
> (default-compile) on project annot8-components-tika: Compilation failure}}
> {{[ERROR] 
> /home/bakerj/annot8/annot8-components/annot8-components-tika/src/main/java/module-info.java:[9,16]
>  module not found: tika.parsers}}
>  
> It looks like this is likely to be a similar issue to: 
> [https://github.com/elastic/elasticsearch/issues/28984]
> For an example of a failing project, see: 
> [https://github.com/annot8/annot8-components/tree/tika]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2719) Java 9: Requiring tika-parsers from module-info.java fails with "module not found"

2018-08-30 Thread Bob Paulin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597674#comment-16597674
 ] 

Bob Paulin commented on TIKA-2719:
--

I would suggest adding {{Automatic-Module-Name:}} tika.parsers to the 
tika-parsers MANIFEST.MF.  This will at least allow us to explicitly name the 
module until we can properly create a module-info.java for the parsers.  See 
[http://branchandbound.net/blog/java/2017/12/automatic-module-name/]

 

> Java 9: Requiring tika-parsers from module-info.java fails with "module not 
> found"
> --
>
> Key: TIKA-2719
> URL: https://issues.apache.org/jira/browse/TIKA-2719
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
>Reporter: James Baker
>Priority: Major
>  Labels: java9, module
>
> When requiring `tika.parsers` from a Java 9 `module-info.java`, Maven throws 
> an error about not being able to find `tika.parsers`:
>  
> {{[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.7.0:compile 
> (default-compile) on project annot8-components-tika: Compilation failure}}
> {{[ERROR] 
> /home/bakerj/annot8/annot8-components/annot8-components-tika/src/main/java/module-info.java:[9,16]
>  module not found: tika.parsers}}
>  
> It looks like this is likely to be a similar issue to: 
> [https://github.com/elastic/elasticsearch/issues/28984]
> For an example of a failing project, see: 
> [https://github.com/annot8/annot8-components/tree/tika]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2710) Set Tika to OSGi Execution Environment JavaSE-1.8

2018-08-17 Thread Bob Paulin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2710.
--
Resolution: Fixed

> Set Tika to OSGi Execution Environment JavaSE-1.8
> -
>
> Key: TIKA-2710
> URL: https://issues.apache.org/jira/browse/TIKA-2710
> Project: Tika
>  Issue Type: Improvement
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Major
>
> In TIKA-2692 the OSGi execution environment (osgi.ee) was removed since it 
> was coming out as 9.0 when 1.8 was expected.  This is due to the 
> org.ow2.asm:asm:6.2 JAR file having a module-info.class within it.  Turns out 
> this causes the maven-bundle-plugin to increase the execution environment to 
> Java 9.  Rather than removing osgi.ee we should manually set it to 1.8 which 
> is correct since Java 8 will simply ignore the module-info.class.  This can 
> be reverted once Tika moves up to Java 9 or beyond.{{}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2692) Blanket upgrades in prep for 1.19

2018-08-17 Thread Bob Paulin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584249#comment-16584249
 ] 

Bob Paulin commented on TIKA-2692:
--

[~talli...@apache.org] create TIKA-2710 for the more improved solution.  
However it is fair to say that the change you made does not do any harm.  It 
just means the OSGi environment won't enforce the version of Java on the tika 
bundle.

> Blanket upgrades in prep for 1.19
> -
>
> Key: TIKA-2692
> URL: https://issues.apache.org/jira/browse/TIKA-2692
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
>
> On the dev list [~solomax] recommended using: 
> https://sonatype.github.io/ossindex-maven/maven-plugin/ to identify 
> vulnerable dependencies.
> Let's do that and make other general upgrades as well in prep for 1.19.
> This is a blanket ticket to cover these upgrades.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2710) Set Tika to OSGi Execution Environment JavaSE-1.8

2018-08-17 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2710:


 Summary: Set Tika to OSGi Execution Environment JavaSE-1.8
 Key: TIKA-2710
 URL: https://issues.apache.org/jira/browse/TIKA-2710
 Project: Tika
  Issue Type: Improvement
Reporter: Bob Paulin
Assignee: Bob Paulin


In TIKA-2692 the OSGi execution environment (osgi.ee) was removed since it was 
coming out as 9.0 when 1.8 was expected.  This is due to the 
org.ow2.asm:asm:6.2 JAR file having a module-info.class within it.  Turns out 
this causes the maven-bundle-plugin to increase the execution environment to 
Java 9.  Rather than removing osgi.ee we should manually set it to 1.8 which is 
correct since Java 8 will simply ignore the module-info.class.  This can be 
reverted once Tika moves up to Java 9 or beyond.{{}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2660) Prep Tika for Java 10

2018-06-04 Thread Bob Paulin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500404#comment-16500404
 ] 

Bob Paulin commented on TIKA-2660:
--

Hey Tim.  I have some thoughts on this but I do think the parsers will be a 
challenge without splitting into modules.  Tika should still work without the 
module-info.java on Java 10.  Folks could use automatic modules [1] if they 
wanted to use modules.  I remember when I was playing with tika on Java 9 
Pre-Release builds there were some issues with how we're accessing some 
resources within module path vs classpath.  I should have some notes on this 
from my 2016 ApacheCon talk.


 

[1] http://paulbakker.io/java/java9-vertx/

> Prep Tika for Java 10
> -
>
> Key: TIKA-2660
> URL: https://issues.apache.org/jira/browse/TIKA-2660
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> At the very least, we need to fix some split packages...some of which are my 
> fault...sorry!
> I'm guessing we can add this to the todo list for 2.0.0?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2506) Nullpointer in tika-dl test on windows

2017-11-17 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2506.
--
Resolution: Fixed

> Nullpointer in tika-dl test on windows
> --
>
> Key: TIKA-2506
> URL: https://issues.apache.org/jira/browse/TIKA-2506
> Project: Tika
>  Issue Type: Bug
>  Components: tika-dl
>Affects Versions: 1.17
> Environment: Windows
>Reporter: Bob Paulin
>Assignee: Bob Paulin
> Fix For: 1.17
>
>
> During a build on windows I get the following:
> {code}
> Running org.apache.tika.dl.imagerec.DL4JVGG16NetTest
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.024 sec <<< 
> FAILURE! - in org.apache.tika.dl.imagerec.DL4JVGG16NetTest
> recognise(org.apache.tika.dl.imagerec.DL4JVGG16NetTest)  Time elapsed: 0.024 
> sec  <<< ERROR!
> java.lang.NullPointerException: null
>   at org.apache.tika.Tika.(Tika.java:109)
>   at 
> org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:42)
> HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 0:
>   #000: C:\autotest\HDF5110ReleaseRWDITAR\src\H5F.c line 579 in H5Fopen(): 
> unable to open file
> major: File accessibilty
> minor: Unable to open file
>   #001: C:\autotest\HDF5110ReleaseRWDITAR\src\H5Fint.c line 1208 in 
> H5F_open(): unable to read superblock
> major: File accessibilty
> minor: Read failed
> SUREFIRE-859:   #002: C:\autotest\HDF5110ReleaseRWDITAR\src\H5Fsuper.c line 
> 443 in H5F__super_read(): truncated file: eof = 147097136, sblock->base_addr 
> = 0, stored_eof = 553466928
> major: File accessibilty
> minor: File has been truncated
> Results :
> Tests in error: 
>   DL4JVGG16NetTest.recognise:42 » NullPointer
> {code}
> It appears to be looking for some installed native code that it can't find.  
> I believe we should check for null config and if null we skip this test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2506) Nullpointer in tika-dl test on windows

2017-11-17 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2506:


 Summary: Nullpointer in tika-dl test on windows
 Key: TIKA-2506
 URL: https://issues.apache.org/jira/browse/TIKA-2506
 Project: Tika
  Issue Type: Bug
  Components: tika-dl
Affects Versions: 1.17
 Environment: Windows
Reporter: Bob Paulin
Assignee: Bob Paulin
 Fix For: 1.17


During a build on windows I get the following:
{code}
Running org.apache.tika.dl.imagerec.DL4JVGG16NetTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.024 sec <<< 
FAILURE! - in org.apache.tika.dl.imagerec.DL4JVGG16NetTest
recognise(org.apache.tika.dl.imagerec.DL4JVGG16NetTest)  Time elapsed: 0.024 
sec  <<< ERROR!
java.lang.NullPointerException: null
at org.apache.tika.Tika.(Tika.java:109)
at 
org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:42)

HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) thread 0:
  #000: C:\autotest\HDF5110ReleaseRWDITAR\src\H5F.c line 579 in H5Fopen(): 
unable to open file
major: File accessibilty
minor: Unable to open file
  #001: C:\autotest\HDF5110ReleaseRWDITAR\src\H5Fint.c line 1208 in H5F_open(): 
unable to read superblock
major: File accessibilty
minor: Read failed
SUREFIRE-859:   #002: C:\autotest\HDF5110ReleaseRWDITAR\src\H5Fsuper.c line 443 
in H5F__super_read(): truncated file: eof = 147097136, sblock->base_addr = 0, 
stored_eof = 553466928
major: File accessibilty
minor: File has been truncated

Results :

Tests in error: 
  DL4JVGG16NetTest.recognise:42 » NullPointer
{code}

It appears to be looking for some installed native code that it can't find.  I 
believe we should check for null config and if null we skip this test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2502) Upgrade OpenNLP to 1.8.3

2017-11-17 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257726#comment-16257726
 ] 

Bob Paulin commented on TIKA-2502:
--

Ok so I should have a patch shortly to upgrade the maven-bundle-plugin to 3.3.0 
no workaround needed.  However did require some restructuring of the pom.xml.  
I guess it's time to get cranking on the 2.0 stuff!

> Upgrade OpenNLP to 1.8.3
> 
>
> Key: TIKA-2502
> URL: https://issues.apache.org/jira/browse/TIKA-2502
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2502) Upgrade OpenNLP to 1.8.3

2017-11-14 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252146#comment-16252146
 ] 

Bob Paulin commented on TIKA-2502:
--

This looks like the error is from the felix maven-bundle-plugin not an issue 
with the Felix runtime.  Will need to take a look in a bit seems like some 
transitive dependency changes that the plugin is not finding are more likely 
the cause here.

> Upgrade OpenNLP to 1.8.3
> 
>
> Key: TIKA-2502
> URL: https://issues.apache.org/jira/browse/TIKA-2502
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2411) Clean up tika-bundle

2017-07-03 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072715#comment-16072715
 ] 

Bob Paulin commented on TIKA-2411:
--

opennlp-maxent and jwnl used to be a transitive dependency pulled in from 
opennlp-tools.  Looks like since we upgraded they are no longer used.  Both can 
be safely removed.


> Clean up tika-bundle
> 
>
> Key: TIKA-2411
> URL: https://issues.apache.org/jira/browse/TIKA-2411
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>
> We're getting:
> {noformat}
> [WARNING] Embed-Dependency: clause "opennlp-maxent" did not match any 
> dependencies
> [WARNING] Embed-Dependency: clause "jwnl" did not match any dependencies
> {noformat}
> when building tika-bundle.  Should we delete these or do we need to fix them?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2379) tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not exists

2017-05-31 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2379.
--
Resolution: Fixed

> tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not 
> exists
> --
>
> Key: TIKA-2379
> URL: https://issues.apache.org/jira/browse/TIKA-2379
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Claus Ibsen
>Assignee: Bob Paulin
>Priority: Blocker
>
> The new release 1.15 now fails in Apache Camel when we run our OSGi tests
> {code}
> test(org.apache.camel.itest.karaf.CamelTikaTest)  Time elapsed: 7.212 sec  
> <<< ERROR!
> org.ops4j.pax.exam.WrappedTestContainerException: 
> [test(org.apache.camel.itest.karaf.CamelTikaTest): Unable to resolve root: 
> missing requirement [root] osgi.identity; osgi.identity=camel-tika; 
> type=karaf.feature; version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; 
> filter:="(&(osgi.identity=camel-tika)(type=karaf.feature)(version>=2.20.0.SNAPSHOT)(version<=2.20.0.SNAPSHOT))"
>  [caused by: Unable to resolve camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [camel-tika/2.20.0.SNAPSHOT] osgi.identity; 
> osgi.identity=org.apache.camel.camel-tika; type=osgi.bundle; 
> version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; resolution:=mandatory [caused 
> by: Unable to resolve org.apache.camel.camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [org.apache.camel.camel-tika/2.20.0.SNAPSHOT] 
> osgi.wiring.package; 
> filter:="(osgi.wiring.package=org.apache.tika.parser.html)" [caused by: 
> Unable to resolve org.apache.tika.bundle/1.15.0: missing requirement 
> [org.apache.tika.bundle/1.15.0] osgi.wiring.package; 
> filter:="(&(osgi.wiring.package=org.slf4j.event)(version>=1.7.0)(!(version>=2.0.0)))"
>   at 
> org.apache.felix.resolver.ResolutionError.toException(ResolutionError.java:42)
>   at 
> org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:389)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:375)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:347)
>   at 
> org.apache.karaf.features.internal.region.SubsystemResolver.resolve(SubsystemResolver.java:218)
>   at 
> org.apache.karaf.features.internal.service.Deployer.deploy(Deployer.java:285)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.doProvision(FeaturesServiceImpl.java:1170)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.lambda$doProvisionInThread$0(FeaturesServiceImpl.java:1069)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> The problem is that tika-bundle has an import on package
> {code}
> org.slf4j.event;version="[1.7,2)"
> {code}
> And that package does not exists in 1.7.x. It looks like its a new thing that 
> comes in slf4j 1.8 onwards but they have only alpha releases
> http://search.maven.org/#search%7Cga%7C1%7Cfc%3A%22org.slf4j.event%22
> It would be good to get this fixed. I wonder if that event package is really 
> needed? And if not then please remove that import in the OSGi manifest file.
> Otherwise you would need to depend on slf4j-api version 1.8 which is still 
> not released in GA and not widely in use. I would suggest to be compatible 
> with slfj4 1.7 so the Tika upgrade from eg 1.14 to 1.15 is a smooth upgrade 
> for end users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2379) tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not exists

2017-05-31 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032423#comment-16032423
 ] 

Bob Paulin commented on TIKA-2379:
--

Looks like a lot changed in this bundle between 1.14.  I couldn't find any 
JIRAs or detailed commit comments to backup these modifications so I did revert 
many of them.  Here's the highlights:
* Changed SLF4J's resolution to be  optional again.  This should fix 
[~davsclaus] issue.
* Removed Dynamic-Import statement.  I don't believe this is appropriate.
* Added Pax Logging bundles to the test as some of the Parsers do require 
org.apache.commons.logging to run.
* Upgraded the OSGi spec, Felix runtime, and Pax Exam for the test runner.

Please review.

> tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not 
> exists
> --
>
> Key: TIKA-2379
> URL: https://issues.apache.org/jira/browse/TIKA-2379
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Claus Ibsen
>Assignee: Bob Paulin
>Priority: Blocker
>
> The new release 1.15 now fails in Apache Camel when we run our OSGi tests
> {code}
> test(org.apache.camel.itest.karaf.CamelTikaTest)  Time elapsed: 7.212 sec  
> <<< ERROR!
> org.ops4j.pax.exam.WrappedTestContainerException: 
> [test(org.apache.camel.itest.karaf.CamelTikaTest): Unable to resolve root: 
> missing requirement [root] osgi.identity; osgi.identity=camel-tika; 
> type=karaf.feature; version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; 
> filter:="(&(osgi.identity=camel-tika)(type=karaf.feature)(version>=2.20.0.SNAPSHOT)(version<=2.20.0.SNAPSHOT))"
>  [caused by: Unable to resolve camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [camel-tika/2.20.0.SNAPSHOT] osgi.identity; 
> osgi.identity=org.apache.camel.camel-tika; type=osgi.bundle; 
> version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; resolution:=mandatory [caused 
> by: Unable to resolve org.apache.camel.camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [org.apache.camel.camel-tika/2.20.0.SNAPSHOT] 
> osgi.wiring.package; 
> filter:="(osgi.wiring.package=org.apache.tika.parser.html)" [caused by: 
> Unable to resolve org.apache.tika.bundle/1.15.0: missing requirement 
> [org.apache.tika.bundle/1.15.0] osgi.wiring.package; 
> filter:="(&(osgi.wiring.package=org.slf4j.event)(version>=1.7.0)(!(version>=2.0.0)))"
>   at 
> org.apache.felix.resolver.ResolutionError.toException(ResolutionError.java:42)
>   at 
> org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:389)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:375)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:347)
>   at 
> org.apache.karaf.features.internal.region.SubsystemResolver.resolve(SubsystemResolver.java:218)
>   at 
> org.apache.karaf.features.internal.service.Deployer.deploy(Deployer.java:285)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.doProvision(FeaturesServiceImpl.java:1170)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.lambda$doProvisionInThread$0(FeaturesServiceImpl.java:1069)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> The problem is that tika-bundle has an import on package
> {code}
> org.slf4j.event;version="[1.7,2)"
> {code}
> And that package does not exists in 1.7.x. It looks like its a new thing that 
> comes in slf4j 1.8 onwards but they have only alpha releases
> http://search.maven.org/#search%7Cga%7C1%7Cfc%3A%22org.slf4j.event%22
> It would be good to get this fixed. I wonder if that event package is really 
> needed? And if not then please remove that import in the OSGi manifest file.
> Otherwise you would need to depend on slf4j-api version 1.8 which is still 
> not released in GA and not widely in use. I would suggest to be compatible 
> with slfj4 1.7 so the Tika upgrade from eg 1.14 to 1.15 is a smooth upgrade 
> for end users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2379) tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not exists

2017-05-31 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031223#comment-16031223
 ] 

Bob Paulin commented on TIKA-2379:
--

Will take a look.  I'm guessing there was a change in the dependency tree that 
caused this.  We can probably add a PaxExam test with a SLF4J 1.7 bundle to 
ensure this works in future releases.  Thanks for the heads up 
[~talli...@mitre.org]

> tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not 
> exists
> --
>
> Key: TIKA-2379
> URL: https://issues.apache.org/jira/browse/TIKA-2379
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Claus Ibsen
>Priority: Blocker
>
> The new release 1.15 now fails in Apache Camel when we run our OSGi tests
> {code}
> test(org.apache.camel.itest.karaf.CamelTikaTest)  Time elapsed: 7.212 sec  
> <<< ERROR!
> org.ops4j.pax.exam.WrappedTestContainerException: 
> [test(org.apache.camel.itest.karaf.CamelTikaTest): Unable to resolve root: 
> missing requirement [root] osgi.identity; osgi.identity=camel-tika; 
> type=karaf.feature; version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; 
> filter:="(&(osgi.identity=camel-tika)(type=karaf.feature)(version>=2.20.0.SNAPSHOT)(version<=2.20.0.SNAPSHOT))"
>  [caused by: Unable to resolve camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [camel-tika/2.20.0.SNAPSHOT] osgi.identity; 
> osgi.identity=org.apache.camel.camel-tika; type=osgi.bundle; 
> version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; resolution:=mandatory [caused 
> by: Unable to resolve org.apache.camel.camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [org.apache.camel.camel-tika/2.20.0.SNAPSHOT] 
> osgi.wiring.package; 
> filter:="(osgi.wiring.package=org.apache.tika.parser.html)" [caused by: 
> Unable to resolve org.apache.tika.bundle/1.15.0: missing requirement 
> [org.apache.tika.bundle/1.15.0] osgi.wiring.package; 
> filter:="(&(osgi.wiring.package=org.slf4j.event)(version>=1.7.0)(!(version>=2.0.0)))"
>   at 
> org.apache.felix.resolver.ResolutionError.toException(ResolutionError.java:42)
>   at 
> org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:389)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:375)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:347)
>   at 
> org.apache.karaf.features.internal.region.SubsystemResolver.resolve(SubsystemResolver.java:218)
>   at 
> org.apache.karaf.features.internal.service.Deployer.deploy(Deployer.java:285)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.doProvision(FeaturesServiceImpl.java:1170)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.lambda$doProvisionInThread$0(FeaturesServiceImpl.java:1069)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> The problem is that tika-bundle has an import on package
> {code}
> org.slf4j.event;version="[1.7,2)"
> {code}
> And that package does not exists in 1.7.x. It looks like its a new thing that 
> comes in slf4j 1.8 onwards but they have only alpha releases
> http://search.maven.org/#search%7Cga%7C1%7Cfc%3A%22org.slf4j.event%22
> It would be good to get this fixed. I wonder if that event package is really 
> needed? And if not then please remove that import in the OSGi manifest file.
> Otherwise you would need to depend on slf4j-api version 1.8 which is still 
> not released in GA and not widely in use. I would suggest to be compatible 
> with slfj4 1.7 so the Tika upgrade from eg 1.14 to 1.15 is a smooth upgrade 
> for end users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (TIKA-2379) tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not exists

2017-05-31 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin reassigned TIKA-2379:


Assignee: Bob Paulin

> tika-bundle 1.15 has wrong import of org.sfl4j.event package which does not 
> exists
> --
>
> Key: TIKA-2379
> URL: https://issues.apache.org/jira/browse/TIKA-2379
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Claus Ibsen
>Assignee: Bob Paulin
>Priority: Blocker
>
> The new release 1.15 now fails in Apache Camel when we run our OSGi tests
> {code}
> test(org.apache.camel.itest.karaf.CamelTikaTest)  Time elapsed: 7.212 sec  
> <<< ERROR!
> org.ops4j.pax.exam.WrappedTestContainerException: 
> [test(org.apache.camel.itest.karaf.CamelTikaTest): Unable to resolve root: 
> missing requirement [root] osgi.identity; osgi.identity=camel-tika; 
> type=karaf.feature; version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; 
> filter:="(&(osgi.identity=camel-tika)(type=karaf.feature)(version>=2.20.0.SNAPSHOT)(version<=2.20.0.SNAPSHOT))"
>  [caused by: Unable to resolve camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [camel-tika/2.20.0.SNAPSHOT] osgi.identity; 
> osgi.identity=org.apache.camel.camel-tika; type=osgi.bundle; 
> version="[2.20.0.SNAPSHOT,2.20.0.SNAPSHOT]"; resolution:=mandatory [caused 
> by: Unable to resolve org.apache.camel.camel-tika/2.20.0.SNAPSHOT: missing 
> requirement [org.apache.camel.camel-tika/2.20.0.SNAPSHOT] 
> osgi.wiring.package; 
> filter:="(osgi.wiring.package=org.apache.tika.parser.html)" [caused by: 
> Unable to resolve org.apache.tika.bundle/1.15.0: missing requirement 
> [org.apache.tika.bundle/1.15.0] osgi.wiring.package; 
> filter:="(&(osgi.wiring.package=org.slf4j.event)(version>=1.7.0)(!(version>=2.0.0)))"
>   at 
> org.apache.felix.resolver.ResolutionError.toException(ResolutionError.java:42)
>   at 
> org.apache.felix.resolver.ResolverImpl.doResolve(ResolverImpl.java:389)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:375)
>   at org.apache.felix.resolver.ResolverImpl.resolve(ResolverImpl.java:347)
>   at 
> org.apache.karaf.features.internal.region.SubsystemResolver.resolve(SubsystemResolver.java:218)
>   at 
> org.apache.karaf.features.internal.service.Deployer.deploy(Deployer.java:285)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.doProvision(FeaturesServiceImpl.java:1170)
>   at 
> org.apache.karaf.features.internal.service.FeaturesServiceImpl.lambda$doProvisionInThread$0(FeaturesServiceImpl.java:1069)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> The problem is that tika-bundle has an import on package
> {code}
> org.slf4j.event;version="[1.7,2)"
> {code}
> And that package does not exists in 1.7.x. It looks like its a new thing that 
> comes in slf4j 1.8 onwards but they have only alpha releases
> http://search.maven.org/#search%7Cga%7C1%7Cfc%3A%22org.slf4j.event%22
> It would be good to get this fixed. I wonder if that event package is really 
> needed? And if not then please remove that import in the OSGi manifest file.
> Otherwise you would need to depend on slf4j-api version 1.8 which is still 
> not released in GA and not widely in use. I would suggest to be compatible 
> with slfj4 1.7 so the Tika upgrade from eg 1.14 to 1.15 is a smooth upgrade 
> for end users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2245) Standardise logging

2017-01-19 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830464#comment-15830464
 ] 

Bob Paulin commented on TIKA-2245:
--

[~grossws] In my experience OSGi is pretty unopinionated on logging.  For 
exmaple Pax-logging is what I often use 
https://ops4j1.jira.com/wiki/display/ops4j/Using+pax+logging

> Standardise logging
> ---
>
> Key: TIKA-2245
> URL: https://issues.apache.org/jira/browse/TIKA-2245
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14, 1.15
>Reporter: Matthew Caruana Galizia
>Assignee: Konstantin Gribov
>  Labels: logging
>
> Tika parsers sometimes use Log4j's Logger, sometimes the JUL 
> (java.util.logging) Logger and sometimes SLF4j.
> It would be better to standardise on a single facade, for the sake of not 
> having to configure multiple loggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2245) Standardise logging

2017-01-19 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830464#comment-15830464
 ] 

Bob Paulin edited comment on TIKA-2245 at 1/19/17 7:30 PM:
---

[~grossws] In my experience OSGi is pretty unopinionated on logging.  For 
exmaple Pax-logging is what I often use 
https://ops4j1.jira.com/wiki/display/ops4j/Using+pax+logging .  SLF4J is 
supported


was (Author: bobpaulin):
[~grossws] In my experience OSGi is pretty unopinionated on logging.  For 
exmaple Pax-logging is what I often use 
https://ops4j1.jira.com/wiki/display/ops4j/Using+pax+logging

> Standardise logging
> ---
>
> Key: TIKA-2245
> URL: https://issues.apache.org/jira/browse/TIKA-2245
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14, 1.15
>Reporter: Matthew Caruana Galizia
>Assignee: Konstantin Gribov
>  Labels: logging
>
> Tika parsers sometimes use Log4j's Logger, sometimes the JUL 
> (java.util.logging) Logger and sometimes SLF4j.
> It would be better to standardise on a single facade, for the sake of not 
> having to configure multiple loggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch

2016-09-19 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2083:


 Summary: Tika 2.0 - Audit master branch against 2.x branch
 Key: TIKA-2083
 URL: https://issues.apache.org/jira/browse/TIKA-2083
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin
Priority: Blocker
 Fix For: 2.0


At this point Tika has been doing parallel development on master and the 2.x 
for about 9 months.  We should audit commit logs for that time to make a best 
effort to identify any commits that may not have been applied in 2.x.  This 
task should be done prior to the 2.0 release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2076) Tika 2.0 - Tika App using bundles

2016-09-11 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2076:


 Summary: Tika 2.0 - Tika App using bundles
 Key: TIKA-2076
 URL: https://issues.apache.org/jira/browse/TIKA-2076
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Create a tika app using bundles.  See here for more details - 
https://github.com/bobpaulin/tika-app-osgi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2070) Tika 2.0 - Add Encoding Detector and Language Detectors to Dynamic Service Loader

2016-09-09 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2070.
--
Resolution: Fixed

> Tika 2.0 - Add Encoding Detector and Language Detectors to Dynamic Service 
> Loader
> -
>
> Key: TIKA-2070
> URL: https://issues.apache.org/jira/browse/TIKA-2070
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently only Parser and Detector classes are added to the dynamic 
> ServiceLoader list.  We should extend this to include the EncodingDetector 
> and LanguageDetector  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2075) Tika 2.0 - Expose Additional TikaService methods

2016-09-09 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-2075:
-
Summary: Tika 2.0  - Expose Additional TikaService methods  (was: Tika 2.0  
- Expose Additonal TikaService methods)

> Tika 2.0  - Expose Additional TikaService methods
> -
>
> Key: TIKA-2075
> URL: https://issues.apache.org/jira/browse/TIKA-2075
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> TikaService should also expose direct access to wrapped member variables such 
> as ServiceLoader, Parser, and Detector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2072) Tika 2.0 - Create TikaServiceFactory for creating TikaService

2016-09-09 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2072.
--
Resolution: Fixed

> Tika 2.0 - Create TikaServiceFactory for creating TikaService
> -
>
> Key: TIKA-2072
> URL: https://issues.apache.org/jira/browse/TIKA-2072
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> In order to create TikaService objects with different configs we should have 
> a factory available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2075) Tika 2.0 - Expose Additonal TikaService methods

2016-09-09 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2075:


 Summary: Tika 2.0  - Expose Additonal TikaService methods
 Key: TIKA-2075
 URL: https://issues.apache.org/jira/browse/TIKA-2075
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


TikaService should also expose direct access to wrapped member variables such 
as ServiceLoader, Parser, and Detector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2074) Tika 2.0 - Allow ServiceLoader to use Class files loaded via dynamic loading

2016-09-09 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2074:


 Summary: Tika 2.0 - Allow ServiceLoader to use Class files loaded 
via dynamic loading
 Key: TIKA-2074
 URL: https://issues.apache.org/jira/browse/TIKA-2074
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Currently the ServiceLoader depends on the classes loaded by the static 
ServiceLoader to instantiate classes within the TikaConfig.  This should be 
extended to include dynamic classes as well to function in a dynamic OSGi 
environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2073) Tika 2.0 - Tika Language Detect Project should include Bundle Activator and packaging consistant with other modules

2016-09-09 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2073:


 Summary: Tika 2.0 - Tika Language Detect Project should include 
Bundle Activator and packaging consistant with other modules
 Key: TIKA-2073
 URL: https://issues.apache.org/jira/browse/TIKA-2073
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Currently tika-langdetect does not register the LanguageDetectors as bundles 
and includes dependencies that are not OSGi friendly so they can be embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2072) Tika 2.0 - Create TikaServiceFactory for creating TikaService

2016-09-09 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2072:


 Summary: Tika 2.0 - Create TikaServiceFactory for creating 
TikaService
 Key: TIKA-2072
 URL: https://issues.apache.org/jira/browse/TIKA-2072
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


In order to create TikaService objects with different configs we should have a 
factory available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2016-09-09 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-2071:
-
Issue Type: Bug  (was: Improvement)

> Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers 
> from dynamic ServiceLoader Parsers
> ---
>
> Key: TIKA-2071
> URL: https://issues.apache.org/jira/browse/TIKA-2071
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
> Fix For: 2.0
>
>
> The DefaultParser and CompositeParser do not filter dynamic services using 
> the excludedParser List.  The exclude list should be applied here as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2070) Tika 2.0 - Add Encoding Detector and Language Detectors to Dynamic Service Loader

2016-09-09 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-2070:
-
Affects Version/s: 2.0

> Tika 2.0 - Add Encoding Detector and Language Detectors to Dynamic Service 
> Loader
> -
>
> Key: TIKA-2070
> URL: https://issues.apache.org/jira/browse/TIKA-2070
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently only Parser and Detector classes are added to the dynamic 
> ServiceLoader list.  We should extend this to include the EncodingDetector 
> and LanguageDetector  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2016-09-09 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2071:


 Summary: Tika 2.0 - DefaultParser and CompositeParser does not 
filter excludedParsers from dynamic ServiceLoader Parsers
 Key: TIKA-2071
 URL: https://issues.apache.org/jira/browse/TIKA-2071
 Project: Tika
  Issue Type: Improvement
Reporter: Bob Paulin
Assignee: Bob Paulin
 Fix For: 2.0


The DefaultParser and CompositeParser do not filter dynamic services using the 
excludedParser List.  The exclude list should be applied here as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2070) Tika 2.0 - Add Encoding Detector and Language Detectors to Dynamic Service Loader

2016-09-09 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2070:


 Summary: Tika 2.0 - Add Encoding Detector and Language Detectors 
to Dynamic Service Loader
 Key: TIKA-2070
 URL: https://issues.apache.org/jira/browse/TIKA-2070
 Project: Tika
  Issue Type: Improvement
Reporter: Bob Paulin
Assignee: Bob Paulin


Currently only Parser and Detector classes are added to the dynamic 
ServiceLoader list.  We should extend this to include the EncodingDetector and 
LanguageDetector  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TIKA-2061) Tika 2.0 - Embed xmpcore dependency in tika-xmp bundle

2016-08-29 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin reopened TIKA-2061:
--

Would someone be able to review this to ensure I added the xmpcore BSD license 
correctly?  Thanks!

> Tika 2.0 - Embed xmpcore dependency in tika-xmp bundle
> --
>
> Key: TIKA-2061
> URL: https://issues.apache.org/jira/browse/TIKA-2061
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> The xmpcore is not a functional OSGi bundle. Suggest embedding it inside the 
> tika-xmp jar so it can function in an OSGi environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2063) Tika 2.0 - Create Vorbis Parser bundle

2016-08-29 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2063.
--
Resolution: Fixed
  Assignee: Bob Paulin

> Tika 2.0 - Create Vorbis Parser bundle
> --
>
> Key: TIKA-2063
> URL: https://issues.apache.org/jira/browse/TIKA-2063
> Project: Tika
>  Issue Type: Task
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Vorbis Parsers are hosted outside of Tika but are included in the current 
> tika app.  Need to create a bundle version of this..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2060) Tika 2.0 - Add Toggle to Tika Batch ClassLoaderUtil to enable OSGi loading

2016-08-29 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2060.
--
Resolution: Fixed

> Tika 2.0 - Add Toggle to Tika Batch ClassLoaderUtil to enable OSGi loading
> --
>
> Key: TIKA-2060
> URL: https://issues.apache.org/jira/browse/TIKA-2060
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently the ClassLoaderUtil class in Tika Batch only using the System 
> Classloader which doesn't work properly when used in OSGi.  Would like to add 
> an optional attribute to tika-batch-config called useBundleClassLoader to 
> allow the system classloader to be replaced with the classloader used for the 
> ClassLoaderUtil class (which would be the same as the bundle that loaded it).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2063) Tika 2.0 - Create Vorbis Parser bundle

2016-08-28 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2063:


 Summary: Tika 2.0 - Create Vorbis Parser bundle
 Key: TIKA-2063
 URL: https://issues.apache.org/jira/browse/TIKA-2063
 Project: Tika
  Issue Type: Task
Reporter: Bob Paulin


Vorbis Parsers are hosted outside of Tika but are included in the current tika 
app.  Need to create a bundle version of this..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2061) Tika 2.0 - Embed xmpcore dependency in tika-xmp bundle

2016-08-28 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2061.
--
Resolution: Fixed

> Tika 2.0 - Embed xmpcore dependency in tika-xmp bundle
> --
>
> Key: TIKA-2061
> URL: https://issues.apache.org/jira/browse/TIKA-2061
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> The xmpcore is not a functional OSGi bundle. Suggest embedding it inside the 
> tika-xmp jar so it can function in an OSGi environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2062) Tika 2.0 - Remove Inlining of Bouncy Castle jars in tika-bundle projects

2016-08-28 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-2062.
--
Resolution: Fixed

> Tika 2.0 - Remove Inlining of Bouncy Castle jars in tika-bundle projects
> 
>
> Key: TIKA-2062
> URL: https://issues.apache.org/jira/browse/TIKA-2062
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> When bouncy castle is inlined it embeds several files in the META-INF folder 
> than can cause the de.thetaphi:forbiddenapis maven plugin to fail.  When they 
> are not inlined the files remain in the bouncy castle jar and do not cause 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2062) Tika 2.0 - Remove Inlining of Bouncy Castle jars in tika-bundle projects

2016-08-27 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2062:


 Summary: Tika 2.0 - Remove Inlining of Bouncy Castle jars in 
tika-bundle projects
 Key: TIKA-2062
 URL: https://issues.apache.org/jira/browse/TIKA-2062
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


When bouncy castle is inlined it embeds several files in the META-INF folder 
than can cause the de.thetaphi:forbiddenapis maven plugin to fail.  When they 
are not inlined the files remain in the bouncy castle jar and do not cause 
issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2061) Tika 2.0 - Embed xmpcore dependency in tika-xmp bundle

2016-08-27 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2061:


 Summary: Tika 2.0 - Embed xmpcore dependency in tika-xmp bundle
 Key: TIKA-2061
 URL: https://issues.apache.org/jira/browse/TIKA-2061
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


The xmpcore is not a functional OSGi bundle. Suggest embedding it inside the 
tika-xmp jar so it can function in an OSGi environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2060) Tika 2.0 - Add Toggle to Tika Batch ClassLoaderUtil to enable OSGi loading

2016-08-27 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2060:


 Summary: Tika 2.0 - Add Toggle to Tika Batch ClassLoaderUtil to 
enable OSGi loading
 Key: TIKA-2060
 URL: https://issues.apache.org/jira/browse/TIKA-2060
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Currently the ClassLoaderUtil class in Tika Batch only using the System 
Classloader which doesn't work properly when used in OSGi.  Would like to add 
an optional attribute to tika-batch-config called useBundleClassLoader to allow 
the system classloader to be replaced with the classloader used for the 
ClassLoaderUtil class (which would be the same as the bundle that loaded it).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2059) Tika 2.0 - Merge PDF and Multimedia Modules

2016-08-27 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2059:


 Summary: Tika 2.0 - Merge PDF and Multimedia Modules
 Key: TIKA-2059
 URL: https://issues.apache.org/jira/browse/TIKA-2059
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Need to merge Tika PDF and Multimedia modules due to coupling with TesseractOCR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1844) PooledTimeSeriesParser takes precedence over MP4Parser

2016-04-27 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260071#comment-15260071
 ] 

Bob Paulin commented on TIKA-1844:
--

+1.  Good time to be running into this stuff since we're still pre-release for 
2.x

> PooledTimeSeriesParser takes precedence over MP4Parser
> --
>
> Key: TIKA-1844
> URL: https://issues.apache.org/jira/browse/TIKA-1844
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.13
>
>
> The PooledTimeSeriesParser currently takes precedence over the MP4Parser even 
> if the pooled-time-series application is not installed.  This means that 
> clients will lose metadata formerly extracted by the MP4Parser unless they 
> remove the PooledTimeSeriesParser.
> This is similar to what happened with the integration of the Tesseract Parser 
> (TIKA-1445).  We should probably follow a similar pattern to that...run both 
> parsers and combine metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1844) PooledTimeSeriesParser takes precedence over MP4Parser

2016-04-21 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253255#comment-15253255
 ] 

Bob Paulin commented on TIKA-1844:
--

[~talli...@mitre.org] So the most simple solution would be to just move the 
PooledTimeSeriesParser to the multimedia bundle.  It's supported mediatypes are 
all video so it could fit.  Otherwise I could set up the MP4Parser using a 
ParserProxy so multimedia is just optional rather than required.

> PooledTimeSeriesParser takes precedence over MP4Parser
> --
>
> Key: TIKA-1844
> URL: https://issues.apache.org/jira/browse/TIKA-1844
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.13
>
>
> The PooledTimeSeriesParser currently takes precedence over the MP4Parser even 
> if the pooled-time-series application is not installed.  This means that 
> clients will lose metadata formerly extracted by the MP4Parser unless they 
> remove the PooledTimeSeriesParser.
> This is similar to what happened with the integration of the Tesseract Parser 
> (TIKA-1445).  We should probably follow a similar pattern to that...run both 
> parsers and combine metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1844) PooledTimeSeriesParser takes precedence over MP4Parser

2016-04-21 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252918#comment-15252918
 ] 

Bob Paulin commented on TIKA-1844:
--

I should be able to take a look tomorrow.  After removing POI from all but 
Office I think anything is possible :).

> PooledTimeSeriesParser takes precedence over MP4Parser
> --
>
> Key: TIKA-1844
> URL: https://issues.apache.org/jira/browse/TIKA-1844
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.13
>
>
> The PooledTimeSeriesParser currently takes precedence over the MP4Parser even 
> if the pooled-time-series application is not installed.  This means that 
> clients will lose metadata formerly extracted by the MP4Parser unless they 
> remove the PooledTimeSeriesParser.
> This is similar to what happened with the integration of the Tesseract Parser 
> (TIKA-1445).  We should probably follow a similar pattern to that...run both 
> parsers and combine metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-04-04 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225474#comment-15225474
 ] 

Bob Paulin commented on TIKA-1910:
--

Thinking of a way to pass the ServiceLoader...

Allow Detectors, Parsers, and EncodingDetectors to leverage ServiceLoader in 
Constructor:
Move object instantiation into the ServiceLoader class (replaces all instances 
of klass.newInstance() in ServiceLoader and TikaConfig
{code}
public  T createServiceInstance(Class klass)
throws InstantiationException, IllegalAccessException, 
InvocationTargetException {
T serviceInstance = null;
try{
serviceInstance = 

(T)klass.getConstructor(ServiceLoader.class).newInstance(this);
}catch(NoSuchMethodException e) {}
if(serviceInstance == null)
{
serviceInstance = (T) klass.newInstance();
}
return serviceInstance;
}
{code}

Then newInstance becomes
{code}
loaded = loader.createServiceInstance(loadedClass);
{code}

And service classes can have ServiceLoader Constructors:

{code}
 public ChmParser(ServiceLoader serviceLoader) {

this.htmlProxy = 
serviceLoader.getProxyService("org.apache.tika.parser.html.HtmlParser", 
Parser.class, ChmParser.class.getClassLoader());
}
{code}

Thoughts on this approach?   Or is the no arg constructor sacred?



> Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies
> -
>
> Key: TIKA-1910
> URL: https://issues.apache.org/jira/browse/TIKA-1910
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently the Tika Parser Office Module depends on 
> Tika Parser Web Module
> Tika Parser Package Module
> Tika Parser Text Module
> Using the proxies we can make those dependencies optional so if they are not 
> included on the classpath the code functions but performs no operation on 
> content that would be parsed on the optional dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-31 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219848#comment-15219848
 ] 

Bob Paulin commented on TIKA-1910:
--

Booleans in Java always default to false so we should consider changing 
property name to org.apache.tika.service.proxy.error.ignore and 
org.apache.tika.service.error.ignore 

> Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies
> -
>
> Key: TIKA-1910
> URL: https://issues.apache.org/jira/browse/TIKA-1910
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently the Tika Parser Office Module depends on 
> Tika Parser Web Module
> Tika Parser Package Module
> Tika Parser Text Module
> Using the proxies we can make those dependencies optional so if they are not 
> included on the classpath the code functions but performs no operation on 
> content that would be parsed on the optional dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-31 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219815#comment-15219815
 ] 

Bob Paulin commented on TIKA-1910:
--

So just realized that the above code actually defaults to IGNORE which is what 
the ServiceLoader does with the org.apache.tika.service.error.warn property.  
Since the ServiceLoader defaults to IGNORE should we do the same or flip it?  
If we flip it then that makes the choice for a new property much easier :).

> Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies
> -
>
> Key: TIKA-1910
> URL: https://issues.apache.org/jira/browse/TIKA-1910
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently the Tika Parser Office Module depends on 
> Tika Parser Web Module
> Tika Parser Package Module
> Tika Parser Text Module
> Using the proxies we can make those dependencies optional so if they are not 
> included on the classpath the code functions but performs no operation on 
> content that would be parsed on the optional dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1915) Tika 2.0 - Remove POI from all but Tika Office Module

2016-03-30 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-1915.
--
Resolution: Fixed

> Tika 2.0 - Remove POI from all but Tika Office Module
> -
>
> Key: TIKA-1915
> URL: https://issues.apache.org/jira/browse/TIKA-1915
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Replace POI library with tika and commons library dependencies.  This will 
> reduce the size of transitive dependencies needed for parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2016-03-30 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219299#comment-15219299
 ] 

Bob Paulin commented on TIKA-1706:
--

Performed removal of duplicated tika.io classes and replaced with commons-io 
2.4 in the Tika 2.0 branch.

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.13
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1915) Tika 2.0 - Remove POI from all but Tika Office Module

2016-03-30 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1915:


 Summary: Tika 2.0 - Remove POI from all but Tika Office Module
 Key: TIKA-1915
 URL: https://issues.apache.org/jira/browse/TIKA-1915
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Replace POI library with tika and commons library dependencies.  This will 
reduce the size of transitive dependencies needed for parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-30 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219202#comment-15219202
 ] 

Bob Paulin commented on TIKA-1910:
--

For now defaulting to WARN but with an new property 

{code}
Boolean.getBoolean("org.apache.tika.service.proxy.error.warn") 
? LoadErrorHandler.WARN:LoadErrorHandler.IGNORE
{code}

On the fence whether this should just use the existing 
org.apache.tika.service.error.warn property or a new one.  Thoughts?

> Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies
> -
>
> Key: TIKA-1910
> URL: https://issues.apache.org/jira/browse/TIKA-1910
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently the Tika Parser Office Module depends on 
> Tika Parser Web Module
> Tika Parser Package Module
> Tika Parser Text Module
> Using the proxies we can make those dependencies optional so if they are not 
> included on the classpath the code functions but performs no operation on 
> content that would be parsed on the optional dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-29 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216019#comment-15216019
 ] 

Bob Paulin commented on TIKA-1910:
--

bq. This sounds dangerous. Should we set the default LoadErrorHandler to WARN? 
I think Uwe didn't want this behavior on the Solr side...perhaps they could set 
it to IGNORE?

Yes I mulled this a bit and even originally committed it as WARN but changed it 
to IGNORE.  I'm fine with switching it back to WARN but I agree we'll want a 
good way to globally turn it to IGNORE.  We have several ways of setting this 
in the code in the ServiceLoader.  I would think most folks would expect the 
setting they put in the config.xml or via the system property to be carried 
forward to the proxy.  Off the top of my head I can't think of a good way too 
get access to that setting without passing the ServiceLoader to the classes 
that will be creating Proxies.  Might need to think about this more.  Open to 
suggestions.


> Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies
> -
>
> Key: TIKA-1910
> URL: https://issues.apache.org/jira/browse/TIKA-1910
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently the Tika Parser Office Module depends on 
> Tika Parser Web Module
> Tika Parser Package Module
> Tika Parser Text Module
> Using the proxies we can make those dependencies optional so if they are not 
> included on the classpath the code functions but performs no operation on 
> content that would be parsed on the optional dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-26 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-1910.
--
Resolution: Fixed

> Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies
> -
>
> Key: TIKA-1910
> URL: https://issues.apache.org/jira/browse/TIKA-1910
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Currently the Tika Parser Office Module depends on 
> Tika Parser Web Module
> Tika Parser Package Module
> Tika Parser Text Module
> Using the proxies we can make those dependencies optional so if they are not 
> included on the classpath the code functions but performs no operation on 
> content that would be parsed on the optional dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1910) Tika 2.0 - Decouple Tika Parser Office Module from Other Dependencies

2016-03-25 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1910:


 Summary: Tika 2.0 - Decouple Tika Parser Office Module from Other 
Dependencies
 Key: TIKA-1910
 URL: https://issues.apache.org/jira/browse/TIKA-1910
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Currently the Tika Parser Office Module depends on 
Tika Parser Web Module
Tika Parser Package Module
Tika Parser Text Module

Using the proxies we can make those dependencies optional so if they are not 
included on the classpath the code functions but performs no operation on 
content that would be parsed on the optional dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1909) Tika 2.0 - Allow Proxy Parser and Detectors to accept Classloaders

2016-03-25 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-1909.
--
Resolution: Fixed

> Tika 2.0 - Allow Proxy Parser and Detectors to accept Classloaders
> --
>
> Key: TIKA-1909
> URL: https://issues.apache.org/jira/browse/TIKA-1909
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Proxy Parsers and Detectors will need to be loaded from specific classloaders 
> in OSGi and multi-classloader environments.  Ensure that classloaders can be 
> passed easily to Proxies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1909) Tika 2.0 - Allow Proxy Parser and Detectors to accept Classloaders

2016-03-25 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1909:


 Summary: Tika 2.0 - Allow Proxy Parser and Detectors to accept 
Classloaders
 Key: TIKA-1909
 URL: https://issues.apache.org/jira/browse/TIKA-1909
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


Proxy Parsers and Detectors will need to be loaded from specific classloaders 
in OSGi and multi-classloader environments.  Ensure that classloaders can be 
passed easily to Proxies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1905) Fix JavaDoc Failures on Java 8

2016-03-19 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-1905.
--
Resolution: Fixed

> Fix JavaDoc Failures on Java 8
> --
>
> Key: TIKA-1905
> URL: https://issues.apache.org/jira/browse/TIKA-1905
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.0, 1.13
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> JavaDoc fails on Java 8.  See 
> http://stackoverflow.com/questions/15886209/maven-is-not-working-in-java-8-when-javadoc-tags-are-incomplete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1905) Fix JavaDoc Failures on Java 8

2016-03-19 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1905:


 Summary: Fix JavaDoc Failures on Java 8
 Key: TIKA-1905
 URL: https://issues.apache.org/jira/browse/TIKA-1905
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.0, 1.13
Reporter: Bob Paulin
Assignee: Bob Paulin


JavaDoc fails on Java 8.  See 
http://stackoverflow.com/questions/15886209/maven-is-not-working-in-java-8-when-javadoc-tags-are-incomplete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1904) Tika 2.0 - Create Proxy Parser and Detectors

2016-03-19 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-1904:
-
Description: 
There are several parsers and detectors that instantiate parsers and detectors 
that live in different modules in tika 2.0.  As of now these modules have are 
dependent on other modules this includes:
tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, 
tika-parser-package-module
tika-parser-ebook-module -> tika-parser-text-module
tika-parser-journal-module -> tika-parser-pdf-module

May of these dependencies could be made optional by introducing the concept of 
proxy parser and detectors that would enable functionality if all the 
dependencies are included in the project but not throw a ClassNotFoundException 
if the dependent module was not include( ex. parse function would do nothing).

EX
Currently
ChmParser
{code}
private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
HtmlParser htmlParser = new HtmlParser();
ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}

Instead the HtmlParser could be Proxyed in the constructor
{code}
private final Parser htmlProxyParser;

public ChmParser() {
this.htmlProxyParser = new 
ParserProxy("org.apache.tika.parser.html.HtmlParser");
}
{code}

And 

{code}

private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlProxyParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}



  was:
There are several parsers and detectors that instantiate parsers and detectors 
that live in different modules in tika 2.0.  As of now these modules have are 
dependent on other modules this includes:
tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, 
tika-parser-package-module
tika-parser-ebook-module -> tika-parser-text-module
tika-parser-journal-module -> tika-parser-pdf-module

May of these dependencies could be made optional by introducing the concept of 
proxy parser and detectors that would enable functionality if all the 
dependencies are included in the project but not throw a ClassNotFoundException 
if the dependent module was not include( ex. parse function would do nothing).

EX
Currently
ChmParser
{code}
private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
HtmlParser htmlParser = new HtmlParser();
ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}

Instead the HtmlParser could be Proxyed in the constructor
{code}
private final Parser htmlProxyParser;

public ChmParser() {
this.htmlProxyParser = new 
ProxyParser("org.apache.tika.parser.html.HtmlParser");
}
{code}

And 

{code}

private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlProxyParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}




> Tika 2.0 - Create Proxy 

[jira] [Created] (TIKA-1904) Tika 2.0 - Create Proxy Parser and Detectors

2016-03-19 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1904:


 Summary: Tika 2.0 - Create Proxy Parser and Detectors
 Key: TIKA-1904
 URL: https://issues.apache.org/jira/browse/TIKA-1904
 Project: Tika
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin


There are several parsers and detectors that instantiate parsers and detectors 
that live in different modules in tika 2.0.  As of now these modules have are 
dependent on other modules this includes:
tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, 
tika-parser-package-module
tika-parser-ebook-module -> tika-parser-text-module
tika-parser-journal-module -> tika-parser-pdf-module

May of these dependencies could be made optional by introducing the concept of 
proxy parser and detectors that would enable functionality if all the 
dependencies are included in the project but not throw a ClassNotFoundException 
if the dependent module was not include( ex. parse function would do nothing).

EX
Currently
ChmParser
{code}
private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
HtmlParser htmlParser = new HtmlParser();
ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}

Instead the HtmlParser could be Proxyed in the constructor
{code}
private final Parser htmlProxyParser;

public ChmParser() {
this.htmlProxyParser = new 
ProxyParser("org.apache.tika.parser.html.HtmlParser");
}
{code}

And 

{code}

private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlProxyParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-03-19 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201898#comment-15201898
 ] 

Bob Paulin commented on TIKA-1855:
--

1. +1
2. +1
3. +0 I think this could work but could we use tika-app for this?  I think both 
work but wouldn't it be fun to flat out remove a project if we could?
4. +1
5. +1 so we'll be using getResourceAsStream to access the files correct?

> TIka 2.0 - Move shared test-code back to tika-core and distribute test files 
> to parser modules
> --
>
> Key: TIKA-1855
> URL: https://issues.apache.org/jira/browse/TIKA-1855
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>
> Undo TIKA-1851, and divide test docs to appropriate parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-03-15 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195282#comment-15195282
 ] 

Bob Paulin commented on TIKA-1894:
--

I think that sounds like a good idea.

> Add XMPMM metadata extraction to JempboxExtractor
> -
>
> Key: TIKA-1894
> URL: https://issues.apache.org/jira/browse/TIKA-1894
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
>
> The XMP Media Management (XMPMM) section of xmp carries some useful 
> information.  We currently have keys for many of the important attributes in 
> tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but 
> the wiring between the two has not yet been installed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1900) Java 9 ThreadPoolExecutor Requires Max to be set first

2016-03-12 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin resolved TIKA-1900.
--
Resolution: Fixed

> Java 9 ThreadPoolExecutor Requires Max to be set first 
> ---
>
> Key: TIKA-1900
> URL: https://issues.apache.org/jira/browse/TIKA-1900
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> In Java 9 the setCorePoolSize checks the max to ensure the core < max.  
> Otherwise it throws an IllegalArgumentException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1900) Java 9 ThreadPoolExecutor Requires Max to be set first

2016-03-12 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin reassigned TIKA-1900:


Assignee: Bob Paulin

> Java 9 ThreadPoolExecutor Requires Max to be set first 
> ---
>
> Key: TIKA-1900
> URL: https://issues.apache.org/jira/browse/TIKA-1900
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> In Java 9 the setCorePoolSize checks the max to ensure the core < max.  
> Otherwise it throws an IllegalArgumentException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1900) Java 9 ThreadPoolExecutor Requires Max to be set first

2016-03-12 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1900:


 Summary: Java 9 ThreadPoolExecutor Requires Max to be set first 
 Key: TIKA-1900
 URL: https://issues.apache.org/jira/browse/TIKA-1900
 Project: Tika
  Issue Type: Bug
Reporter: Bob Paulin


In Java 9 the setCorePoolSize checks the max to ensure the core < max.  
Otherwise it throws an IllegalArgumentException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-03-10 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190439#comment-15190439
 ] 

Bob Paulin commented on TIKA-1894:
--

[~talli...@mitre.org] So after looking at this I'm thinking a new module might 
be overkill here.  There's no parsers in it so there's no need for there to be 
an Activator class also I see a number of the image classes instantiating 
objects that do not need to be instantiated.
{code}
new JempboxExtractor(metadata).parse(tis);
{code}

could be
{code}
JempboxExtractor.parse(metadata, tis);
{code}

  I feel the pain that there is shared code between pdf and multimedia now.  
Maybe just a simple shared util jar?

> Add XMPMM metadata extraction to JempboxExtractor
> -
>
> Key: TIKA-1894
> URL: https://issues.apache.org/jira/browse/TIKA-1894
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
>
> The XMP Media Management (XMPMM) section of xmp carries some useful 
> information.  We currently have keys for many of the important attributes in 
> tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but 
> the wiring between the two has not yet been installed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1860) Tika 2.0 - Create Module OSGi implementations to replace tika-bundle

2016-03-06 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-1860:
-
Affects Version/s: 2.0

> Tika 2.0 - Create Module OSGi implementations to replace tika-bundle
> 
>
> Key: TIKA-1860
> URL: https://issues.apache.org/jira/browse/TIKA-1860
> Project: Tika
>  Issue Type: Sub-task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create a replacement for the OSGi tika-bundle project out of the new 
> tika-parser-* modules



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >