from:"Uwe Schindler $JIRA$"

[jira] [Commented] (TIKA-2743) Replace com.sun.xml.bind:jaxb-impl and jaxb-core by org.glassfish.jaxb:jaxb-runtime and jaxb-core

2018-11-09 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681602#comment-16681602
 ] 

Uwe Schindler commented on TIKA-2743:
-

bq. Tim Allison shouldn't jaxb-runtime have runtime, rather than compile scope?

If we don't need runtime details, yes. But weren't we talking about a direct 
dependency to the "com.sun" classes, which are now in glassfish namespace. If 
we require that at compile time, it must be a compile dependency.

bq. License should work? CDDL 1.1

CDDL license is fine. But license and copyright must be mentioned in the NOTICE 
file! See Apache License guidelines.

> Replace com.sun.xml.bind:jaxb-impl and jaxb-core by 
> org.glassfish.jaxb:jaxb-runtime and jaxb-core
> -
>
> Key: TIKA-2743
> URL: https://issues.apache.org/jira/browse/TIKA-2743
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.19
>Reporter: Thomas Mortagne
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.0.0, 1.19.1
>
>
> com.sun.xml.bind:* is actually the old name and is currently a repackaging of 
> org.glassfish.jaxb:*. probably kept as a retro compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)

2018-09-05 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604643#comment-16604643
 ] 

Uwe Schindler commented on TIKA-2722:
-

bq.  I reported it to Oracle using their normal channel for reporting bugs. 

Once you get the internal ID, send it to Rory, helps to speedup. Especially as 
this is shortly before the relesae. IMHO thats a real bug and should be fixed 
before release! Not sure about their priority internals :-)

> Don't call Date.toString (Possible issue with JDK 11)
> -
>
> Key: TIKA-2722
> URL: https://issues.apache.org/jira/browse/TIKA-2722
> Project: Tika
>  Issue Type: Bug
> Environment: Tika 1.18, JDK 11 with locale set to "ar-EG".  
>Reporter: David Smiley
>Priority: Major
>
> I'm troubleshooting [a test failure in Apache 
> Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] 
> "extracting" contrib that occurs in JDK 11 with locale "ar-EG".  JDK 8 & 9 
> passes; I don't know about JDK 10. It has to do with extracting date metadata 
> from a PDF, particularly the created date but perhaps others too.
> I stepped through the code into Tika and I think I've found out where the 
> troublesome code is.  First note PDFParser line 271: {{addMetadata(metadata, 
> "created", info.getCreationDate());}}.  That addMetadata overload variant 
> will call toString on a Date.  IMO that's asking for trouble since the output 
> of that is Locale-dependent.  I think that's okay to show to a user but not 
> for machine-to-machine information exchange.  In the case of the test, it 
> yielded this odd looking date string:
> Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008
> I pasted that in and it looks consistent with what I see in IntelliJ and in 
> Jenkins logs; hopefully will post correctly to JIRA.  The odd part is the 
> hour & minutes relative to GMT.  I won't be certain until after I click 
> "Create".
> Perhaps this problem is also indicative of a JDK 11 bug?  Nevertheless I 
> think Tika should avoid calling Date.toString().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)

2018-09-05 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604641#comment-16604641
 ] 

Uwe Schindler commented on TIKA-2722:
-

Cool thanks for the reproducer. That's indeed a bug, as you explicitely set 
locale on the call to {{getDisplayName()}}. It still uses default timezone to 
return the value. BUG!

> Don't call Date.toString (Possible issue with JDK 11)
> -
>
> Key: TIKA-2722
> URL: https://issues.apache.org/jira/browse/TIKA-2722
> Project: Tika
>  Issue Type: Bug
> Environment: Tika 1.18, JDK 11 with locale set to "ar-EG".  
>Reporter: David Smiley
>Priority: Major
>
> I'm troubleshooting [a test failure in Apache 
> Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] 
> "extracting" contrib that occurs in JDK 11 with locale "ar-EG".  JDK 8 & 9 
> passes; I don't know about JDK 10. It has to do with extracting date metadata 
> from a PDF, particularly the created date but perhaps others too.
> I stepped through the code into Tika and I think I've found out where the 
> troublesome code is.  First note PDFParser line 271: {{addMetadata(metadata, 
> "created", info.getCreationDate());}}.  That addMetadata overload variant 
> will call toString on a Date.  IMO that's asking for trouble since the output 
> of that is Locale-dependent.  I think that's okay to show to a user but not 
> for machine-to-machine information exchange.  In the case of the test, it 
> yielded this odd looking date string:
> Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008
> I pasted that in and it looks consistent with what I see in IntelliJ and in 
> Jenkins logs; hopefully will post correctly to JIRA.  The odd part is the 
> hour & minutes relative to GMT.  I won't be certain until after I click 
> "Create".
> Perhaps this problem is also indicative of a JDK 11 bug?  Nevertheless I 
> think Tika should avoid calling Date.toString().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)

2018-09-05 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604516#comment-16604516
 ] 

Uwe Schindler commented on TIKA-2722:
-

[~dsmiley]: I think this is a bug in Java 11. I know there were some changes 
with formatting time zones. According to their docs, the timezones are now 
printed according to the selected locale, if none given, the default one. This 
is fine in most cases, but seems to affect locales where the digits are 
different (non-ascii). Previously timezones that have no name (numeric only) 
seem to have been printed in ASCII digits. Nevertheless, only the timezone is 
printed with locale dependent digits, not the date itsself (reason: no date 
formatter is used, it just concats integers to format the date in toString for 
compatibility reasons).

Did you send Rory O'Donnel a note, he can speedup assigning the JDK issue ID?!

IMHO: TIKA should stop using java.util.Date and should go for java.time APIs, 
maybe start with using Instant instead of Date.

> Don't call Date.toString (Possible issue with JDK 11)
> -
>
> Key: TIKA-2722
> URL: https://issues.apache.org/jira/browse/TIKA-2722
> Project: Tika
>  Issue Type: Bug
> Environment: Tika 1.18, JDK 11 with locale set to "ar-EG".  
>Reporter: David Smiley
>Priority: Major
>
> I'm troubleshooting [a test failure in Apache 
> Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] 
> "extracting" contrib that occurs in JDK 11 with locale "ar-EG".  JDK 8 & 9 
> passes; I don't know about JDK 10. It has to do with extracting date metadata 
> from a PDF, particularly the created date but perhaps others too.
> I stepped through the code into Tika and I think I've found out where the 
> troublesome code is.  First note PDFParser line 271: {{addMetadata(metadata, 
> "created", info.getCreationDate());}}.  That addMetadata overload variant 
> will call toString on a Date.  IMO that's asking for trouble since the output 
> of that is Locale-dependent.  I think that's okay to show to a user but not 
> for machine-to-machine information exchange.  In the case of the test, it 
> yielded this odd looking date string:
> Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008
> I pasted that in and it looks consistent with what I see in IntelliJ and in 
> Jenkins logs; hopefully will post correctly to JIRA.  The odd part is the 
> hour & minutes relative to GMT.  I won't be certain until after I click 
> "Create".
> Perhaps this problem is also indicative of a JDK 11 bug?  Nevertheless I 
> think Tika should avoid calling Date.toString().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2667) Upgrade jmatio to 1.4

2018-06-20 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490
 ] 

Uwe Schindler edited comment on TIKA-2667 at 6/20/18 7:04 PM:
--

It's OK because it wont fail, but I dont understand the need to catch Throwable 
and the reason to use AtomicReference. The doPrivileged part cannot throw any 
exception it will always succeed, all exceptions are handled internally! Do 
privileged is not risky as it does not do something like "sudo" (the name of 
method is misleading). It just executes the stuff inside the lambda with the 
privileges of the current code base (and that's documented to be always 
possible, because we call the method ourselves). Without the doPrivileged it 
would call the stuff with caller's privileges. The doPrivileged call is there 
to allow user of the JAR to configure the JVM that only our JAR file can do the 
privileged action. This improves security, because you don't need to give 
everyone the permission to call setAccesible() and access Unsafe. It's only 
important to NOT give the MethodHandle to untrusted code, so it must be 
"private final".

So just copypaste the whole code from Lucene's MMAP directory - the static 
initializer (maybe do code a bit different with error reporting, Lucene uses no 
logging):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336

The return value of the method reference to the private unmapper method  is 
Object to allow to pass a String or a MethodHandle through the return value of 
the privileged block: The code you added using the AtomicReference is not 
needed - exactly because the privileged code returns either a method handle OR 
it returns an error message (that's the trick).

The resourceDescription is used to make the exception more meaningful (in 
Lucene we use filename, so user get's an error about what file handle caused 
the issue). 

This trycatch in the code is obsolete:
https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395

The BufferCleaner interface just throws an IOException if unmapping goes wrong 
- with a meaningful error message. So I'd remove the try-catch block, it's 
legacy.

Maybe I should create a Pull Request? Unfortunately I have no time and no 
checkout of the matfile reader ready



was (Author: thetaphi):
It's OK because it wont fail, but I dont understand the need to catch Throwable 
and the reason to use AtomicReference. The doPrivileged part cannot throw any 
exception it will always succeed, all exceptions are handled internally! Do 
privileged is not risky as it does not do something like "sudo" (the name of 
method is misleading). It just executes the stuff inside the lambda with the 
privileges of the current code base (and that's documented to be always 
possible, because we call the method ourselves). Without the doPrivileged it 
would call the stuff with caller's privileges. The doPrivileged call is there 
to allow user of the JAR to configure the JVM that only our JAR file can do the 
privileged action. This improves security, because you don't need to give 
everyone the permission to call setAccesible() and access Unsafe.

So just copypaste the whole code from Lucene's MMAP directory - the static 
initializer (maybe do code a bit different with error reporting, Lucene uses no 
logging):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336

The return value of the method reference to the private unmapper method  is 
Object to allow to pass a String or a MethodHandle through the return value of 
the privileged block: The code you added using the AtomicReference is not 
needed - exactly because the privileged code returns either a method handle OR 
it returns an error message (that's the trick).

The resourceDescription is used to make the exception more meaningful (in 
Lucene we use filename, so user get's an error about what file handle caused 
the issue). 

This trycatch in the code is obsolete:
https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395

The BufferCleaner interface just throws an IOException if unmapping goes wrong 
- with a meaningful error message. So I'd remove the try-catch block, it's 
legacy.

Maybe I should create a Pull Request? Unfortunately I have no time and no 
checkout of the matfile reader ready


> Upgrade jmatio to 1.4 
> --
>
> Key: TIKA-2667
> URL: https://issues.apache.org/jira/browse/TIKA-2667
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, 
> thanks to a copy/paste from Lucene.
>

[jira] [Comment Edited] (TIKA-2667) Upgrade jmatio to 1.4

2018-06-20 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490
 ] 

Uwe Schindler edited comment on TIKA-2667 at 6/20/18 7:01 PM:
--

It's OK because it wont fail, but I dont understand the need to catch Throwable 
and the reason to use AtomicReference. The doPrivileged part cannot throw any 
exception it will always succeed, all exceptions are handled internally! Do 
privileged is not risky as it does not do something like "sudo" (the name of 
method is misleading). It just executes the stuff inside the lambda with the 
privileges of the current code base (and that's documented to be always 
possible, because we call the method ourselves). Without the doPrivileged it 
would call the stuff with caller's privileges. The doPrivileged call is there 
to allow user of the JAR to configure the JVM that only our JAR file can do the 
privileged action. This improves security, because you don't need to give 
everyone the permission to call setAccesible() and access Unsafe.

So just copypaste the whole code from Lucene's MMAP directory - the static 
initializer (maybe do code a bit different with error reporting, Lucene uses no 
logging):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336

The return value of the method reference to the private unmapper method  is 
Object to allow to pass a String or a MethodHandle through the return value of 
the privileged block: The code you added using the AtomicReference is not 
needed - exactly because the privileged code returns either a method handle OR 
it returns an error message (that's the trick).

The resourceDescription is used to make the exception more meaningful (in 
Lucene we use filename, so user get's an error about what file handle caused 
the issue). 

This trycatch in the code is obsolete:
https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395

The BufferCleaner interface just throws an IOException if unmapping goes wrong 
- with a meaningful error message. So I'd remove the try-catch block, it's 
legacy.

Maybe I should create a Pull Request? Unfortunately I have no time and no 
checkout of the matfile reader ready



was (Author: thetaphi):
It's OK because it wont fail, but I dont understand the need to catch Throwable 
and the reason to use AtomicReference. The doPrivileged part cannot throw any 
exception it will always succeed, all exceptions are handled internally! Do 
privileged is not risky as it does not do something like "sudo" (the name of 
method is misleading). It just executes the stuff inside the lambda with the 
privileges of the current code base (and that's documented to be always 
possible, because we call the method ourselves). Without the doPrivileged it 
would call the stuff with caller's privileges. The doPrivileged call is there 
to allow user of the JAR to configure the JVM that only our JAR file can do the 
privileged action. This improves security, because you don't need to give 
everyone the permission to call setAccesible() and access Unsafe.

So just copypaste the whole code from Lucene's MMAP directory - the static 
initializer (maybe do code a bit different with error reporting, Lucene uses no 
logging):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336

The return value of the method reference to the private unmapper method  is 
Object to allow to pass a String or a MethodHandle through the return value of 
the privileged block: The code you added using the AtomicReference is not 
needed - exactly because the privileged code returns either a method handle OR 
it returns an error message (that's the trick).

The resourceDescription is used to make the exception more meaningful (in 
Lucene we use filename, so user get's an error about what file handle caused 
the issue). 

This code is obsolete:
https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395

The BufferCleaner interface just throws an IOException if unmapping goes wrong 
- with a meaningful error message. So I'd remove the try-catch block, it's 
legacy.

Maybe I should create a Pull Request? Unfortunately I have no time and no 
checkout of the matfile reader ready


> Upgrade jmatio to 1.4 
> --
>
> Key: TIKA-2667
> URL: https://issues.apache.org/jira/browse/TIKA-2667
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, 
> thanks to a copy/paste from Lucene.
> jmatio 1.4 will include one that actually works. Thank you, [~thetaphi]!



--
This message was sent by Atlassian

[jira] [Comment Edited] (TIKA-2667) Upgrade jmatio to 1.4

2018-06-20 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490
 ] 

Uwe Schindler edited comment on TIKA-2667 at 6/20/18 7:00 PM:
--

It's OK because it wont fail, but I dont understand the need to catch Throwable 
and the reason to use AtomicReference. The doPrivileged part cannot throw any 
exception it will always succeed, all exceptions are handled internally! Do 
privileged is not risky as it does not do something like "sudo" (the name of 
method is misleading). It just executes the stuff inside the lambda with the 
privileges of the current code base (and that's documented to be always 
possible, because we call the method ourselves). Without the doPrivileged it 
would call the stuff with caller's privileges. The doPrivileged call is there 
to allow user of the JAR to configure the JVM that only our JAR file can do the 
privileged action. This improves security, because you don't need to give 
everyone the permission to call setAccesible() and access Unsafe.

So just copypaste the whole code from Lucene's MMAP directory - the static 
initializer (maybe do code a bit different with error reporting, Lucene uses no 
logging):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336

The return value of the method reference to the private unmapper method  is 
Object to allow to pass a String or a MethodHandle through the return value of 
the privileged block: The code you added using the AtomicReference is not 
needed - exactly because the privileged code returns either a method handle OR 
it returns an error message (that's the trick).

The resourceDescription is used to make the exception more meaningful (in 
Lucene we use filename, so user get's an error about what file handle caused 
the issue). 

This code is obsolete:
https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395

The BufferCleaner interface just throws an IOException if unmapping goes wrong 
- with a meaningful error message. So I'd remove the try-catch block, it's 
legacy.

Maybe I should create a Pull Request? Unfortunately I have no time and no 
checkout of the matfile reader ready



was (Author: thetaphi):
It's OK because it wont fail, but I dont understand the need to catch Throwable 
and the reason to use AtomicReference. The doPrivileged part cannot throw any 
exception it will always succeed, all exceptions are handled internally! Do 
privileged is not risky as it does not do something like "sudo" (the name of 
method is misleading). It just executes the stuff inside the lambda with the 
privileges of the current code base (ant that's always possible, because we 
call the method). Without the doPrivileged it would call the stuff with 
caller's privileges. The doPrivileged call is there to allow user of the JAR to 
configure the JVM that only our JAR file can do the privileged action. This 
improves security, because you don't need to give everyone the permission to 
call setAccesible() and access Unsafe.

So just copypaste the whole code from Lucene's MMAP directory - the static 
initializer (maybe do code a bit different with error reporting, Lucene uses no 
logging):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336

The return value of the method reference to the private unmapper method  is 
Object to allow to pass a String or a MethodHandle through the return value of 
the privileged block: The code you added using the AtomicReference is not 
needed - exactly because the privileged code returns either a method handle OR 
it returns an error message (that's the trick).

The resourceDescription is used to make the exception more meaningful (in 
Lucene we use filename, so user get's an error about what file handle caused 
the issue). 

This code is obsolete:
https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395

The BufferCleaner interface just throws an IOException if unmapping goes wrong 
- with a meaningful error message. So I'd remove the try-catch block, it's 
legacy.

Maybe I should create a Pull Request? Unfortunately I have no time and no 
checkout of the matfile reader ready


> Upgrade jmatio to 1.4 
> --
>
> Key: TIKA-2667
> URL: https://issues.apache.org/jira/browse/TIKA-2667
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, 
> thanks to a copy/paste from Lucene.
> jmatio 1.4 will include one that actually works. Thank you, [~thetaphi]!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2667) Upgrade jmatio to 1.4

2018-06-20 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490
 ] 

Uwe Schindler commented on TIKA-2667:
-

It's OK because it wont fail, but I dont understand the need to catch Throwable 
and the reason to use AtomicReference. The doPrivileged part cannot throw any 
exception it will always succeed, all exceptions are handled internally! Do 
privileged is not risky as it does not do something like "sudo" (the name of 
method is misleading). It just executes the stuff inside the lambda with the 
privileges of the current code base (ant that's always possible, because we 
call the method). Without the doPrivileged it would call the stuff with 
caller's privileges. The doPrivileged call is there to allow user of the JAR to 
configure the JVM that only our JAR file can do the privileged action. This 
improves security, because you don't need to give everyone the permission to 
call setAccesible() and access Unsafe.

So just copypaste the whole code from Lucene's MMAP directory - the static 
initializer (maybe do code a bit different with error reporting, Lucene uses no 
logging):

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336

The return value of the method reference to the private unmapper method  is 
Object to allow to pass a String or a MethodHandle through the return value of 
the privileged block: The code you added using the AtomicReference is not 
needed - exactly because the privileged code returns either a method handle OR 
it returns an error message (that's the trick).

The resourceDescription is used to make the exception more meaningful (in 
Lucene we use filename, so user get's an error about what file handle caused 
the issue). 

This code is obsolete:
https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395

The BufferCleaner interface just throws an IOException if unmapping goes wrong 
- with a meaningful error message. So I'd remove the try-catch block, it's 
legacy.

Maybe I should create a Pull Request? Unfortunately I have no time and no 
checkout of the matfile reader ready


> Upgrade jmatio to 1.4 
> --
>
> Key: TIKA-2667
> URL: https://issues.apache.org/jira/browse/TIKA-2667
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, 
> thanks to a copy/paste from Lucene.
> jmatio 1.4 will include one that actually works. Thank you, [~thetaphi]!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2667) Upgrade jmatio to 1.3

2018-06-14 Thread Uwe Schindler (JIRA)



[ 
https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512016#comment-16512016
 ] 

Uwe Schindler commented on TIKA-2667:
-

Hi,

I just looked at your code change in jmatio. The 
AccessController.doPrivileged() needs to be around the static initializer 
because the setAccessible(true) is now done there (early). While calling the 
"compiled" cleaner, it can then be sure that it works. Do you think it is a 
good idea to throw runtime exception in the initializer if it fails? This is 
too risky, what happens if somebody uses a too new JDK?

On the place where it actually calls the created cleaner instance, no 
doPrivileged is needed (it's already in the implementation, so done 2 times).

Should I open a bug on your fork?

Uwe

> Upgrade jmatio to 1.3 
> --
>
> Key: TIKA-2667
> URL: https://issues.apache.org/jira/browse/TIKA-2667
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, 
> thanks to a copy/paste from Lucene.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096384#comment-15096384
 ] 

Uwe Schindler commented on TIKA-1830:
-

It would be good to update to 1.8.11 as soon as it is out, because Lucene/Solr 
is affected by PDFBOX-3155: we are testing Java 9 preview builds, and that 
failed because of this bug. For now we disabled the tests around TIKA when 
running with Java 9.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096663#comment-15096663
 ] 

Uwe Schindler commented on TIKA-1830:
-

bq. Speaking of integration with Solr, would you have a chance/any interest in 
offering feedback on our initial restructuring of the parser bundles for Tika 
2.0 (TIKA-1824)? Or more generally, do you and your Solr colleagues have any 
wishes for the 2.0 roadmap?

As already stated in the past, we would like to only bundle parsers for text 
document formats, because images, class files or else are not really useful for 
indexing by default. Users that want to do this, can still add the missing 
parser bundles and SPI will do the rest. Currently we have disabled some 
parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI 
will disable them automatically (because of ClassNotFoundEx). This was a bit 
rude, but worked.

The reason for this was partly also some version incompatibilities (ASM was old 
in TIKA, Lucene needs newest one), but ASM is not really useful for indexing 
anyways!

In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR 
file which one gets bundled, so we check every release anyways during update.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096668#comment-15096668
 ] 

Uwe Schindler commented on TIKA-1824:
-

Hi, as invited on TIKA-1830, here some comments from Apache Solr:

{quote}
As already stated in the past, we would like to only bundle parsers for text 
document formats, because images, class files or else are not really useful for 
indexing by default. Users that want to do this, can still add the missing 
parser bundles and SPI will do the rest. Currently we have disabled some 
parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI 
will disable them automatically (because of ClassNotFoundEx). This was a bit 
rude, but worked.

The reason for this was partly also some version incompatibilities (ASM was old 
in TIKA, Lucene needs newest one), but ASM is not really useful for indexing 
anyways!

In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR 
file which one gets bundled, so we check every release anyways during update.
{quote}

In addition, it would be a good idea to allow loading the TIKA SPI files in a 
separate classloader (to isolate the parser classes from others). The reason 
for this is JAR hell. If TIKA would load the parsers in its own classloader 
(optionally, e.g. by configuration), we could place all parsers and their 
dependencies in a separate lib directory outside the Solr's lib folder.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path

2015-09-30 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1758:

Description: 
All tests for CLI module fail with errors like that:

{noformat}
Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
ineTest
testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
elapsed: 0.026 sec  <<< ERROR!
java.nio.file.InvalidPathException: Illegal char <"> at index 0: "C:\Users\Uwe 
Schindler\Projects\TIKA\svn\tika-app\testInput"
at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
at java.nio.file.Paths.get(Paths.java:84)
at 
org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
at 
org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
at 
org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
{noformat}

The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? If 
you use ProcessBuilder you don't need that! Not sure what this should do, but 
the problem is: The first argument (the executable) contains quotes after the 
method transformed it and breaks the test.

I have no idea how to fix this, but the quotes should not be in a String[] 
command line at all.

  was:
All tests for CLI module fail with errors like that:

{noformat}
Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
ineTest
testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
elapsed: 0.026 sec  <<< ERROR!
java.nio.file.InvalidPathException: Illegal char <"> at index 0: "C:\Users\Uwe 
Schindler\Projects\TIKA\svn\tika-app\testInput"
at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
at java.nio.file.Paths.get(Paths.java:84)
at 
org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
at 
org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
at 
org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
{noformat}

The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? If 
you use ProcessBuilder you don't need that! Not sure what this should do, but 
the problem is: The first argument (the executable) contains quotes afterwards 
and breaks the test.

I have no idea how to fix this, but the quotes should not be in a String[] 
command line at all.


> BatchCommandLineBuilder fails on systems with whitespace in path
> 
>
> Key: TIKA-1758
> URL: https://issues.apache.org/jira/browse/TIKA-1758
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Reporter: Uwe Schindler
>
> All tests for CLI module fail with errors like that:
> {noformat}
> Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
> FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
> ineTest
> testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
> elapsed: 0.026 sec  <<< ERROR!
> java.nio.file.InvalidPathException: Illegal char <"> at index 0: 
> "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput"
> at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
> at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
> at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
> at java.nio.file.Paths.get(Paths.java:84)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
> at 
> org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
> {noformat}
> The reason is that BatchCommandLineBuilder adds quotes for unknown

[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938915#comment-14938915
 ] 

Uwe Schindler commented on TIKA-1757:
-

The other issue is different, I opened TIKA-1758

> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Uwe Schindler (JIRA)

Uwe Schindler created TIKA-1757:
---

 Summary: tika-batch tests fail on systems with whitespace or 
special chars in folder name
 Key: TIKA-1757
 URL: https://issues.apache.org/jira/browse/TIKA-1757
 Project: Tika
  Issue Type: Bug
Reporter: Uwe Schindler


This is one problem that forbiddenapis des not catch, because the method 
affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
return the URL path, which should never be treated as a file system path (for 
file: URLs). This is breaks asap, if the path contains special characters which 
may not be part of URL. getFile() and getPath() return the encoded path.

The correct way to transform a file URL to a file is: {{new 
File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
community for Mojos/Plugins.

In fact the affected test should not use a file at all. Instead it should use 
{{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1756) Update forbiddenapis to v2.0

2015-09-30 Thread Uwe Schindler (JIRA)

Uwe Schindler created TIKA-1756:
---

 Summary: Update forbiddenapis to v2.0
 Key: TIKA-1756
 URL: https://issues.apache.org/jira/browse/TIKA-1756
 Project: Tika
  Issue Type: Improvement
Reporter: Uwe Schindler


Forbiddenapis 2.0 was released a few hours ago. Apache POI and Lucene already 
updated, Tika should do this, too.

Attached is a patch.

{quote}
The main new feature is native support for the Gradle build system (minimum 
requirement is Gradle 2.3). But also Apache Ant and Apache Maven build systems 
got improved support: Ant can now load signatures from arbitrary resources by 
using a new XML element  that may contain any valid 
ANT resource, e.g., ivy's cache-filesets or plain URLs. Apache Maven now 
supports to load signatures files as artifacts from your repository or Maven 
Central (new signaturesArtifacts Mojo property).

Breaking changes:
- Update to Java 6 as minimum requirement.
- Switch default Maven lifecycle phase to verify.

Bug fixes:
- Add automatic plugin execution override for M2E. It is no longer needed to 
add a lifecycle mapping to exclude forbiddenapis to execute inside Eclipse's 
M2E 
{quote}

The M2E change is nice, because you no longer need the M2E workaround to 
disable running the plugin in Eclipse manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path

2015-09-30 Thread Uwe Schindler (JIRA)

Uwe Schindler created TIKA-1758:
---

 Summary: BatchCommandLineBuilder fails on systems with whitespace 
in path
 Key: TIKA-1758
 URL: https://issues.apache.org/jira/browse/TIKA-1758
 Project: Tika
  Issue Type: Bug
  Components: cli
Reporter: Uwe Schindler


All tests for CLI module fail with errors like that:

{noformat}
Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
ineTest
testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
elapsed: 0.026 sec  <<< ERROR!
java.nio.file.InvalidPathException: Illegal char <"> at index 0: "C:\Users\Uwe 
Schindler\Projects\TIKA\svn\tika-app\testInput"
at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
at java.nio.file.Paths.get(Paths.java:84)
at 
org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
at 
org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
at 
org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
{noformat}

The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? If 
you use ProcessBuilder you don't need that! Not sure what this should do, but 
the problem is: The first argument (the executable) contains quotes afterwards 
and breaks the test.

I have no idea how to fix this, but the quotes should not be in a String[] 
command line at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938906#comment-14938906
 ] 

Uwe Schindler commented on TIKA-1757:
-

Please wait with committing there are more tests failing with similar problems: 
Now tika-app, in this case some unneeded quoting.

> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1756) Update forbiddenapis to v2.0

2015-09-30 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1756:

Attachment: TIKA-1756.patch

> Update forbiddenapis to v2.0
> 
>
> Key: TIKA-1756
> URL: https://issues.apache.org/jira/browse/TIKA-1756
> Project: Tika
>  Issue Type: Improvement
>Reporter: Uwe Schindler
> Attachments: TIKA-1756.patch
>
>
> Forbiddenapis 2.0 was released a few hours ago. Apache POI and Lucene already 
> updated, Tika should do this, too.
> Attached is a patch.
> {quote}
> The main new feature is native support for the Gradle build system (minimum 
> requirement is Gradle 2.3). But also Apache Ant and Apache Maven build 
> systems got improved support: Ant can now load signatures from arbitrary 
> resources by using a new XML element  that may 
> contain any valid ANT resource, e.g., ivy's cache-filesets or plain URLs. 
> Apache Maven now supports to load signatures files as artifacts from your 
> repository or Maven Central (new signaturesArtifacts Mojo property).
> Breaking changes:
> - Update to Java 6 as minimum requirement.
> - Switch default Maven lifecycle phase to verify.
> Bug fixes:
> - Add automatic plugin execution override for M2E. It is no longer needed to 
> add a lifecycle mapping to exclude forbiddenapis to execute inside Eclipse's 
> M2E 
> {quote}
> The M2E change is nice, because you no longer need the M2E workaround to 
> disable running the plugin in Eclipse manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1756) Update forbiddenapis to v2.0

2015-09-30 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938879#comment-14938879
 ] 

Uwe Schindler commented on TIKA-1756:
-

While testing this I found out that TIKA's test break when running with 
whitespace in folder name (windows user name with whitespace). But this is 
unrelated to this one. The problematic method is one of the crazy stuff that 
may be put on the forbidden list (Lucene does this): {{URL#getPath()}} is bad, 
bad, bad if used to generate a file name. Must be {{new File(url.toURI())}}

> Update forbiddenapis to v2.0
> 
>
> Key: TIKA-1756
> URL: https://issues.apache.org/jira/browse/TIKA-1756
> Project: Tika
>  Issue Type: Improvement
>Reporter: Uwe Schindler
> Attachments: TIKA-1756.patch
>
>
> Forbiddenapis 2.0 was released a few hours ago. Apache POI and Lucene already 
> updated, Tika should do this, too.
> Attached is a patch.
> {quote}
> The main new feature is native support for the Gradle build system (minimum 
> requirement is Gradle 2.3). But also Apache Ant and Apache Maven build 
> systems got improved support: Ant can now load signatures from arbitrary 
> resources by using a new XML element  that may 
> contain any valid ANT resource, e.g., ivy's cache-filesets or plain URLs. 
> Apache Maven now supports to load signatures files as artifacts from your 
> repository or Maven Central (new signaturesArtifacts Mojo property).
> Breaking changes:
> - Update to Java 6 as minimum requirement.
> - Switch default Maven lifecycle phase to verify.
> Bug fixes:
> - Add automatic plugin execution override for M2E. It is no longer needed to 
> add a lifecycle mapping to exclude forbiddenapis to execute inside Eclipse's 
> M2E 
> {quote}
> The M2E change is nice, because you no longer need the M2E workaround to 
> disable running the plugin in Eclipse manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1757:

Attachment: TIKA-1757.patch

Patch for broken test.

> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938917#comment-14938917
 ] 

Uwe Schindler commented on TIKA-1757:
-

bq. If one needs a java.nio.file.Path, Paths.get(url.toURI()) can be used 
instead.

Of course. But in the affected test using a file just to open an InputStream 
was wrong anyways. So I fixed it by completely removing any File/Path usage.

> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1714) Consider making default host for Tika Server 0.0.0.0 instead of localhost

2015-08-18 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701435#comment-14701435
 ] 

Uwe Schindler commented on TIKA-1714:
-

If you want to bind for all, don't use 0.0.0.0, because this is IPv4 only 
(won't work with IPv6). To bind to all, remove the whole ip adress setting in 
the socket config. It then binds to IPv4 and also IPv6 depending on 
availablibity,

 Consider making default host for Tika Server 0.0.0.0 instead of localhost
 -

 Key: TIKA-1714
 URL: https://issues.apache.org/jira/browse/TIKA-1714
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.10
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.11


 I noticed in Tika-Python on Windows while fixing some bugs that by default 
 Tika Server binds to localhost which means that the Tika Server running on 
 Windows isn't available to external hosts trying to access it on host 
 name:9998. I think the default behavior is that it *should* be available 
 externally, meaning, we should probably bind to the special address, 0.0.0,0 
 which binds to all interfaces. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1714) Consider making default host for Tika Server 0.0.0.0 instead of localhost

2015-08-18 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701442#comment-14701442
 ] 

Uwe Schindler commented on TIKA-1714:
-

In any case, I agree with Nick, we should not do this. Maybe allow to bind to 
all addresses using a System property or allow it to be configured. I have a 
lot of machines with multipe IP adresses, and I want external services only 
bind to one specific - so default should be ::1 / 127.0.0.1 or any ip adress 
the user passes as command line option / system property (like 
{{-Djetty.host=XXX -Djetty.port=XXX}}). The use is then also open to bind to 
IPv4 and/or IPv6 on his own.

 Consider making default host for Tika Server 0.0.0.0 instead of localhost
 -

 Key: TIKA-1714
 URL: https://issues.apache.org/jira/browse/TIKA-1714
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.10
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.11


 I noticed in Tika-Python on Windows while fixing some bugs that by default 
 Tika Server binds to localhost which means that the Tika Server running on 
 Windows isn't available to external hosts trying to access it on host 
 name:9998. I think the default behavior is that it *should* be available 
 externally, meaning, we should probably bind to the special address, 0.0.0,0 
 which binds to all interfaces. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698313#comment-14698313
 ] 

Uwe Schindler commented on TIKA-1706:
-

Yes, you can add the maven property 
{{failOnUnresolvableSignaturesfalse/failOnUnresolvableSignatures to the 
plugin configuration}}: 
[http://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/check-mojo.html#failOnUnresolvableSignatures]

An alternative is to only enable commons-io-unsafe-2.4 only for those modules 
where its used, unfortunately this is not so easy, because you cannot inherit 
only some array values to submodules, you miust reconfigure all 
bundledsignatures in submodules.

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-14 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697961#comment-14697961
 ] 

Uwe Schindler commented on TIKA-1706:
-

If you bring in commons-io, you should also add the corresponding 
forbidden-apis signatures to the POM. commons-io makes it easy to choose the 
wrong IOUtils/FileUtils method and then you are dependent to default charset 
again...

https://github.com/policeman-tools/forbidden-apis/wiki/BundledSignatures

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-11 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated TIKA-1705:

Attachment: TIKA-1705-2.patch

Sorry for a second patch. I just noticed that you were using
asm-debug-all.jar instead of plain simple asm.jar. As this is a very basic
parser, the asm-commons parts or helper visitors are not needed, so we should
fallback to plain asm (also for compatibility with other projects). The -debug
stuff was previously used because of generics warnings in earlier versions
(they stripped off generics from JAR file), but this is no longer an issue.

So please apply this patch, too :-)

Update ASM dependency to 5.0.4
--

Key: TIKA-1705
URL: https://issues.apache.org/jira/browse/TIKA-1705
Project: Tika
Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler
Assignee: Dave Meikle
Fix For: 1.11

Attachments: TIKA-1705-2.patch, TIKA-1705.patch

Currently the Class file parser uses ASM 4.1. This older version cannot read
Java 8 / Java 9 class files (fails with Exception).
The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The
code change is only to update the visitor version, so it gets new Java 8
features like lambdas reported, but this is not really required, but should
be done for full support.
FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM
5, too.
You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no
problem with Lucene using a newer version). Since ASM 4.x the updates are
more easy (no visitor interfaces anymore, instead abstract classes), so it
does not break if you just replace the JAR file. So just see this as a
recommendatation, not urgent! Solr/Lucene will also work without this patch
(it just replaces the shipped ASM by newer version in our packaging).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-11 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reopened TIKA-1705:
-

Reopen for 2nd patch.

 Update ASM dependency to 5.0.4
 --

 Key: TIKA-1705
 URL: https://issues.apache.org/jira/browse/TIKA-1705
 Project: Tika
  Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler
Assignee: Dave Meikle
 Fix For: 1.11

 Attachments: TIKA-1705-2.patch, TIKA-1705.patch


 Currently the Class file parser uses ASM 4.1. This older version cannot read 
 Java 8 / Java 9 class files (fails with Exception).
 The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
 code change is only to update the visitor version, so it gets new Java 8 
 features like lambdas reported, but this is not really required, but should 
 be done for full support.
 FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
 5, too.
 You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
 problem with Lucene using a newer version). Since ASM 4.x the updates are 
 more easy (no visitor interfaces anymore, instead abstract classes), so it 
 does not break if you just replace the JAR file. So just see this as a 
 recommendatation, not urgent! Solr/Lucene will also work without this patch 
 (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-11 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681759#comment-14681759
 ] 

Uwe Schindler commented on TIKA-1705:
-

The question about this: This will not fail tests when new versions of JVMs are 
out. You will only find that problem when new class files are added

In my opinion, a good test would also be to also test a class file from the 
local JVM (e.g., {{String.class.getResourceAsStream('String.class')}}
With that test you would actually make sure that the class files of the JVM 
that compiles can be read! So once Java 9 is out and has a new classfile 
format, this would fail build if somebody runs build with this JVM.

 Update ASM dependency to 5.0.4
 --

 Key: TIKA-1705
 URL: https://issues.apache.org/jira/browse/TIKA-1705
 Project: Tika
  Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler
Assignee: Dave Meikle
 Fix For: 1.11

 Attachments: TIKA-1705-2.patch, TIKA-1705.patch


 Currently the Class file parser uses ASM 4.1. This older version cannot read 
 Java 8 / Java 9 class files (fails with Exception).
 The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
 code change is only to update the visitor version, so it gets new Java 8 
 features like lambdas reported, but this is not really required, but should 
 be done for full support.
 FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
 5, too.
 You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
 problem with Lucene using a newer version). Since ASM 4.x the updates are 
 more easy (no visitor interfaces anymore, instead abstract classes), so it 
 does not break if you just replace the JAR file. So just see this as a 
 recommendatation, not urgent! Solr/Lucene will also work without this patch 
 (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-10 Thread Uwe Schindler (JIRA)

Uwe Schindler created TIKA-1705:
---

 Summary: Update ASM dependency to 5.0.4
 Key: TIKA-1705
 URL: https://issues.apache.org/jira/browse/TIKA-1705
 Project: Tika
  Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler


Currently the Class file parser uses ASM 4.1. This older version cannot read 
Java 8 / Java 9 class files (fails with Exception).

The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The code 
change is only to update the visitor version, so it gets new Java 8 features 
like lambdas reported, but this is not really required, but should be done for 
full support.

FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 5, 
too.

You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
problem with Lucene using a newer version). Since ASM 4.x the updates are more 
easy (no visitor interfaces anymore, instead abstract classes), so it does not 
break if you just replace the JAR file. So just see this as a recommendatation, 
not urgent! Solr/Lucene will also work without this patch (it just replaces the 
shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-10 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1705:

Attachment: TIKA-1705.patch

Simple patch. All tests pass.

 Update ASM dependency to 5.0.4
 --

 Key: TIKA-1705
 URL: https://issues.apache.org/jira/browse/TIKA-1705
 Project: Tika
  Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler
 Attachments: TIKA-1705.patch


 Currently the Class file parser uses ASM 4.1. This older version cannot read 
 Java 8 / Java 9 class files (fails with Exception).
 The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
 code change is only to update the visitor version, so it gets new Java 8 
 features like lambdas reported, but this is not really required, but should 
 be done for full support.
 FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
 5, too.
 You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
 problem with Lucene using a newer version). Since ASM 4.x the updates are 
 more easy (no visitor interfaces anymore, instead abstract classes), so it 
 does not break if you just replace the JAR file. So just see this as a 
 recommendatation, not urgent! Solr/Lucene will also work without this patch 
 (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency

2015-07-07 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617578#comment-14617578
 ] 

Uwe Schindler commented on TIKA-1675:
-

There was already an issue/discussion open on POI mailing lists and issue 
tracker to no longer use xmlbeans  Co, because since Java 6 the JAXB interface 
is a public API that allows to map XML documents to Java Beans - which is 
exactly the same as xmlbeans is dooing. Unfortunately this is a larger approach 
to change the API to do use the standards Java API (and might also bring more 
performance). This would remove a lot of unneeded XML-based stuff from POI for 
Microsoft Office 2007+ file formats.

-1 to absorb the buggy xmlbeans (this lib was also the problem of the major 
Solr/Lucene security issue last year)
+1 to adopt JAXB instead of xmlbeans

 please avoid xmlbeans dependency
 

 Key: TIKA-1675
 URL: https://issues.apache.org/jira/browse/TIKA-1675
 Project: Tika
  Issue Type: Bug
Reporter: Robert Muir

 This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499
 Is there an alternative that could be used?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1675) please avoid xmlbeans dependency

2015-07-07 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617578#comment-14617578
 ] 

Uwe Schindler edited comment on TIKA-1675 at 7/7/15 10:53 PM:
--

There was already an issue/discussion open on POI mailing lists and issue 
tracker to no longer use xmlbeans  Co, because since Java 6 the JAXB interface 
is a public API that allows to map XML documents to Java Beans 
(https://jcp.org/en/jsr/detail?id=222) - which is exactly the same as xmlbeans 
is dooing. Unfortunately this is a larger approach to change the API to do use 
the standards Java API (and might also bring more performance). This would 
remove a lot of unneeded XML-based stuff from POI for Microsoft Office 2007+ 
file formats.

-1 to absorb the buggy xmlbeans (this lib was also the problem of the major 
Solr/Lucene security issue last year)
+1 to adopt JAXB instead of xmlbeans


was (Author: thetaphi):
There was already an issue/discussion open on POI mailing lists and issue 
tracker to no longer use xmlbeans  Co, because since Java 6 the JAXB interface 
is a public API that allows to map XML documents to Java Beans - which is 
exactly the same as xmlbeans is dooing. Unfortunately this is a larger approach 
to change the API to do use the standards Java API (and might also bring more 
performance). This would remove a lot of unneeded XML-based stuff from POI for 
Microsoft Office 2007+ file formats.

-1 to absorb the buggy xmlbeans (this lib was also the problem of the major 
Solr/Lucene security issue last year)
+1 to adopt JAXB instead of xmlbeans

 please avoid xmlbeans dependency
 

 Key: TIKA-1675
 URL: https://issues.apache.org/jira/browse/TIKA-1675
 Project: Tika
  Issue Type: Bug
Reporter: Robert Muir

 This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499
 Is there an alternative that could be used?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency

2015-07-07 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617588#comment-14617588
 ] 

Uwe Schindler commented on TIKA-1675:
-

kiwiwings kiwiwi...@apache.org already proposed this for POI: 
[http://apache-poi.1045710.n5.nabble.com/Re-svn-commit-r1682117-poi-site-src-documentation-content-xdocs-document-index-xml-td5718914.html#a5718928]

But this is really an issue for Apache POI!

 please avoid xmlbeans dependency
 

 Key: TIKA-1675
 URL: https://issues.apache.org/jira/browse/TIKA-1675
 Project: Tika
  Issue Type: Bug
Reporter: Robert Muir

 This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499
 Is there an alternative that could be used?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1637) Oracle internal API jdeps request for information

2015-05-25 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558087#comment-14558087
]

Uwe Schindler commented on TIKA-1637:
-

Hi Dave,

forbidden-apis already forbids use of internal APIs like sun.misc.Unsafe, see
TIKA's parent POM: internalRuntimeForbiddentrue/internalRuntimeForbidden

But indeed, we don't see usage in dependent libraries, so it would be good to
run jdeps on all the millions of dependencies! :-)

Oracle internal API jdeps request for information
-

Key: TIKA-1637
URL: https://issues.apache.org/jira/browse/TIKA-1637
Project: Tika
Issue Type: Task
Reporter: Dave Meikle
Assignee: Dave Meikle
Priority: Trivial

We have been asked to provide information to Oracle around the internal API
usage in Apache Tika to support move to JDK 9, which contains significant
changes.
{quote}
Hi David,
My name is Rory O'Donnell, I am the OpenJDK Quality Group Lead.
I'm contacting you because your open source project seems to be a very
popular dependency for other open source projects.
As part of the preparations for JDK 9, Oracle’s engineers have been analyzing
open source projects like yours to understand usage. One area of concern
involves identifying compatibility problems, such as reliance on JDK-internal
APIs.
Our engineers have already prepared guidance on migrating some of the more
common usage patterns of JDK-internal APIs to supported public interfaces.
The list is on the OpenJDK wiki [0].
As part of the ongoing development of JDK 9, I would like to inquire about
your usage of JDK-internal APIs and to encourage migration towards supported
Java APIs if necessary.
The first step is to identify if your application(s) is leveraging internal
APIs.
Step 1: Download JDeps.
Just download a preview release of JDK8(JDeps Download). You do not need to
actually test or run your application on JDK8. JDeps(Docs) looks through JAR
files and identifies which JAR files use internal APIs and then lists those
APIs.
Step 2: To run JDeps against an application. The command looks like:
jdk8/bin/jdeps -P -jdkinternals *.jar your-application.jdeps.txt
The output inside your-application.jdeps.txt will look like:
your.package (Filename.jar)
- com.sun.corba.seJDK internal API (rt.jar)
3rd party library using Internal APIs:
If your analysis uncovers a third-party component that you rely on, you can
contact the provider and let them know of the upcoming changes. You can then
either work with the provider to get an updated library that won't rely on
Internal APIs, or you can find an alternative provider for the capabilities
that the offending library provides.
Dynamic use of Internal APIs:
JDeps can not detect dynamic use of internal APIs, for example through
reflection, service loaders and similar mechanisms.
Rgds,Rory
[0] https://wiki.openjdk.java.net/display/JDK8/Java+Dependency+Analysis+Tool
--
Rgds,Rory O'Donnell
Quality Engineering Manager
Oracle EMEA , Dublin, Ireland
{quote}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1637) Oracle internal API jdeps request for information

2015-05-25 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558087#comment-14558087
]

Uwe Schindler edited comment on TIKA-1637 at 5/25/15 10:18 AM:
---

Hi Dave,

forbidden-apis already forbids use of internal APIs like sun.misc.Unsafe, see
TIKA's parent POM: {{internalRuntimeForbiddentrue/internalRuntimeForbidden}}

It also forbids deprecated APIs:
{{bundledSignaturejdk-deprecated/bundledSignature}}; This is important
because for the first time in Java's Lifetime, JDK 9 really removed some
deprecated stuff!!! (because this was needed for modularization)

But indeed, we don't see usage in dependent libraries, so it would be good to
run jdeps on all the millions of dependencies! :-)

was (Author: thetaphi):
Hi Dave,

forbidden-apis already forbids use of internal APIs like sun.misc.Unsafe, see
TIKA's parent POM: internalRuntimeForbiddentrue/internalRuntimeForbidden

But indeed, we don't see usage in dependent libraries, so it would be good to
run jdeps on all the millions of dependencies! :-)

Oracle internal API jdeps request for information
-

Key: TIKA-1637
URL: https://issues.apache.org/jira/browse/TIKA-1637
Project: Tika
Issue Type: Task
Reporter: Dave Meikle
Assignee: Dave Meikle
Priority: Trivial

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1628) ExternalParser.check should return false if it hits SecurityException

2015-05-12 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539838#comment-14539838
 ] 

Uwe Schindler commented on TIKA-1628:
-

+1 to the patch. I don't think we need a test!

 ExternalParser.check should return false if it hits SecurityException
 -

 Key: TIKA-1628
 URL: https://issues.apache.org/jira/browse/TIKA-1628
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.9

 Attachments: TIKA-1628.patch


 If you run Tika with a Java security manager that blocks execution of 
 external processes, ExternalParser.check throws SecurityException, but I 
 think it should just return false?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-05-02 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525112#comment-14525112
 ] 

Uwe Schindler edited comment on TIKA-1582 at 5/2/15 7:35 AM:
-

Hi Chris,
there is already forbidden-apis 1.8 available! The main new feature here is to 
allow supressing forbidden checks in classes/methods/fields using annotations 
({{@SuppressForbidden}} or similar, configurable). In the past, we excluded the 
whole class files in Lucene/Elasticsearch (e.g. where we want to write to 
System.out because its a command line tool, which is otherwise completely 
forbidden in Lucene), now we can annotate those methods (see LUCENE-6420). If 
we also need this functionality in TIKA, too - we can update. Bumping the 
version number in any case is fine, too (e.g., for Java 9 support)!
Uwe


was (Author: thetaphi):
Hi Chris,
there is already forbidden-apis 1.8 available! The main new feature here is to 
allow supressing forbidden checks in classes/methods/fields using annotations 
({@SuppressForbidden} or similar, configurable). In the past, we excluded the 
whole class files in Lucene/Elasticsearch (e.g. where we want to write to 
System.out because its a command line tool, which is otherwise completely 
forbidden in Lucene), now we can annotate those methods (see LUCENE-6420). If 
we also need this functionality in TIKA, too - we can update. Bumping the 
version number in any case is fine, too (e.g., for Java 9 support)!
Uwe

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Fix For: 1.9

 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the

[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

2015-05-02 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525112#comment-14525112
 ] 

Uwe Schindler commented on TIKA-1582:
-

Hi Chris,
there is already forbidden-apis 1.8 available! The main new feature here is to 
allow supressing forbidden checks in classes/methods/fields using annotations 
({@SuppressForbidden} or similar, configurable). In the past, we excluded the 
whole class files in Lucene/Elasticsearch (e.g. where we want to write to 
System.out because its a command line tool, which is otherwise completely 
forbidden in Lucene), now we can annotate those methods (see LUCENE-6420). If 
we also need this functionality in TIKA, too - we can update. Bumping the 
version number in any case is fine, too (e.g., for Java 9 support)!
Uwe

 Mime Detection based on neural networks with Byte-frequency-histogram 
 --

 Key: TIKA-1582
 URL: https://issues.apache.org/jira/browse/TIKA-1582
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Fix For: 1.9

 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, 
 week6 report.docx


 Content-based mime type detection is one of the popular approaches to detect 
 mime type, there are others based on file extension and magic numbers ; And 
 currently Tika has implemented 3 approaches in detecting mime types; 
 They are :
 1) file extensions
 2) magic numbers (the most trustworthy in tika)
 3) content-type(the header in the http response if present and available) 
 Content-based mime type detection however analyses the distribution of the 
 entire stream of bytes and find a similar pattern for the same type and build 
 a function that is able to group them into one or several classes so as to 
 classify and predict; It is believed this feature might broaden the usage of 
 Tika with a bit more security enforcement for mime type detection. Because we 
 want to build a model that is etched with the patterns it has seen, in some 
 situations we may not trust those types which have not been trained/learned 
 by the model. In some situations, magic numbers imbedded in the files can be 
 copied but the actual content could be a potentially detrimental Troy 
 program. By enforcing the trust on byte frequency patterns, we are able to 
 enhance the security of the detection.
 The proposed content-based mime detection to be integrated into Tika is based 
 on the machine learning algorithm i.e. neural network with back-propagation. 
 The input: 0-255 bins each of which represent a byte, and and each of which 
 stores the count of occurrences for each byte, and the byte frequency 
 histograms are normalized to fall in the range between 0 and 1, they then are 
 passed to a companding function to enhancement the infrequent bytes.
 The output of the neural network is a binary decision 1 or 0;
 Notice BTW, the proposed feature will be implemented with GRB file type as 
 one example.
 In this example, we build a model that is able to classify GRB file type from 
 non-GRB file types, notice the size of non-GRB files is huge and cannot be 
 easily defined, so there need to be as many negative training example as 
 possible to form this non-GRB types decision boundary.
 The Neural networks is considered as two stage of processes.
 Training and classification.
 The training can be done in any programming language, in this feature 
 /research, the training of neural network is implemented in R and the source 
 can be found in my github repository i.e. 
 https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
 document that describe the use of the program, the syntax/ format of the 
 input and output.
 After training, we need to export the model and import it to Tika; in Tika, 
 we create a TrainedModelDetector that reads this model file with one or more 
 model parameters or several model files,so it can detect the mime types with 
 the model of those mime types. Details of the research and usage with this 
 proposed feature will be posted on my github shortly.
 It is worth noting again that in this research we only worked out one model - 
 GRB as one example to demonstrate the use of this content-based mime 
 detection. One of the challenges again is that the non-GRB file types cannot 
 be clearly defined unless we feed our model with some example data for all of 
 the existing file types in the world, but this seems to be too utopian and a 
 bit less likely, so it is better that the set of class/types is given and 
 defined in advance to minimize the problem domain. 
 Another challenge is the size of the training data;

[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-29 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385803#comment-14385803
]

Uwe Schindler commented on TIKA-1511:
-

Solr uses ANT + IVY to build. We don't use transitive dependencies at all! So
whenever updating TIKA, the person who does this prints the dependency tree and
then fills all required information into the ivy.xml file and our
ivy-versions.properties file :-) In general, we carefully decide, which
dependencies are really needed. Because TIKA automatically disables parser
which do not load, we have already removed various files (like netcdf parser -
LGPL) or the ASM parser (we dont support indexing Java Class files by
default).

For the current one: We dont want to have native libraries anywhere (we don't
even ship our own native libs for WindowsDirectory). Users need to do this
themselves start msvcc/gcc. So we would not ship wth SQLite support by default.

In general it would be good to have some easier plugin mechanism to allow Solr
to pick only some parsers they ship by default and those the user can download
(e.g. by a script). So it would be good to have multiple parser-JARS. So maybe
put all crazy parsers that fork processes or call native libs into a separate
TIKA parser bundle. The default one should only have pure-java stuff with as
few dependencies as possible...

Create a parser for SQLite3
---

Key: TIKA-1511
URL: https://issues.apache.org/jira/browse/TIKA-1511
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Fix For: 1.8

Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch,
TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db

I think it would be very useful, as sqlite is used as data storage by a wide
range of applications. Opening the ticket to track it.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333400#comment-14333400
 ] 

Uwe Schindler commented on TIKA-1558:
-

Hi,
Lucene uses SPI for its index codecs, so we are familar with SPI. But we have 
no problems with order of classpath. We just preserve what Java delivers in 
Classloader.getResources(). But order is not really important (it was important 
for testing in Lucene 4.x, but that's history since last Friday).

We already have a custom TikaConfig class so I am happy to use that. In our 
case we would only put the SPI exclusion into our test classpath. But 
TikaConfig is also fine.

 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist

2015-02-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333400#comment-14333400
 ] 

Uwe Schindler edited comment on TIKA-1558 at 2/23/15 4:06 PM:
--

Hi,
Lucene uses SPI for its index codecs, so we are familar with SPI. But we have 
no problems with order of classpath. We just preserve what Java delivers in 
Classloader.getResources(). But order is not really important (it was important 
for testing in Lucene 4.x, but that's history since last Friday).

We already have custom TikaConfig support in the extraction module, so I am 
happy to use that. In our case we would only put the SPI exclusion into our 
test classpath. But TikaConfig is also fine.


was (Author: thetaphi):
Hi,
Lucene uses SPI for its index codecs, so we are familar with SPI. But we have 
no problems with order of classpath. We just preserve what Java delivers in 
Classloader.getResources(). But order is not really important (it was important 
for testing in Lucene 4.x, but that's history since last Friday).

We already have a custom TikaConfig class so I am happy to use that. In our 
case we would only put the SPI exclusion into our test classpath. But 
TikaConfig is also fine.

 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-02-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333628#comment-14333628
 ] 

Uwe Schindler commented on TIKA-1526:
-

Thanks David!

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523
 ] 

Uwe Schindler commented on TIKA-1557:
-

I would not make this a special option only for tesseract. As said on 
TIKA-1555, it would be better to have a general way to blacklist some parsers 
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF 
yourself) and pass the full list to TikaConfig / AutodetectParser / 
CompositeParser. I would like to have an option in TIKA config to blacklist 
parsers. Ideally this should work alos for subclasses, so one could disable all 
ForkParser subclasses by adding ForkParser to blacklist.

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523
]

Uwe Schindler edited comment on TIKA-1557 at 2/20/15 9:05 PM:
--

I would not make this a special option only for tesseract. As said on
TIKA-1555, it would be better to have a general way to blacklist some parsers
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF
yourself) and pass the full list to TikaConfig / AutodetectParser /
CompositeParser. I would like to have an option in TIKA config to blacklist
parsers. Ideally this should also work for subclasses, so one could disable all
ExternalParser subclasses by adding ExternalParser to blacklist.

was (Author: thetaphi):
I would not make this a special option only for tesseract. As said on
TIKA-1555, it would be better to have a general way to blacklist some parsers
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF
yourself) and pass the full list to TikaConfig / AutodetectParser /
CompositeParser. I would like to have an option in TIKA config to blacklist
parsers. Ideally this should also work for subclasses, so one could disable all
ForkParser subclasses by adding ForkParser to blacklist.

Create TesseractOCR Option to Never Run
---

Key: TIKA-1557
URL: https://issues.apache.org/jira/browse/TIKA-1557
Project: Tika
Issue Type: New Feature
Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
Fix For: 1.8

As brought up in TIKA-1555, TesseractOCRParser should have an option to never
be run. So, we can add an {{enabled}} option to the Config.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523
]

Uwe Schindler edited comment on TIKA-1557 at 2/20/15 8:42 PM:
--

I would not make this a special option only for tesseract. As said on
TIKA-1555, it would be better to have a general way to blacklist some parsers
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF
yourself) and pass the full list to TikaConfig / AutodetectParser /
CompositeParser. I would like to have an option in TIKA config to blacklist
parsers. Ideally this should also work for subclasses, so one could disable all
ForkParser subclasses by adding ForkParser to blacklist.

was (Author: thetaphi):
I would not make this a special option only for tesseract. As said on
TIKA-1555, it would be better to have a general way to blacklist some parsers
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF
yourself) and pass the full list to TikaConfig / AutodetectParser /
CompositeParser. I would like to have an option in TIKA config to blacklist
parsers. Ideally this should work alos for subclasses, so one could disable all
ForkParser subclasses by adding ForkParser to blacklist.

Create TesseractOCR Option to Never Run
---

As brought up in TIKA-1555, TesseractOCRParser should have an option to never
be run. So, we can add an {{enabled}} option to the Config.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329276#comment-14329276
 ] 

Uwe Schindler commented on TIKA-1555:
-

Also, this issue in the JDK is already fixed in Java 7u80 and 8u40 (to be 
released in the next 2 months): https://bugs.openjdk.java.net/browse/JDK-8047340

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329282#comment-14329282
 ] 

Uwe Schindler commented on TIKA-1555:
-

@UweSays: https://twitter.com/UweSays/status/501425093613207552

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329272#comment-14329272
 ] 

Uwe Schindler commented on TIKA-1555:
-

This is a duplicate of TIKA-1526.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329350#comment-14329350
 ] 

Uwe Schindler commented on TIKA-1526:
-

I was not able to test this, because I have no MacOSX computer and FreeBSD is 
only a Jenkins server

Maybe [~dadoonet] can try the same with elasticsearch-mapper-attachments module.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329344#comment-14329344
 ] 

Uwe Schindler commented on TIKA-1555:
-

Hi David,
can you try to compile Tika from current trunk checkout and test it with ES? If 
this fixes the issue with turkish locale, could you report on TIKA-1526. For me 
its hard to reproduce with Windows or Linux. I just have analyzed the issue and 
reported the bug to Oracle and fixed Solr 5.0, but I did no thorough testing on 
the Tika issue.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329364#comment-14329364
 ] 

Uwe Schindler commented on TIKA-1555:
-

bq. BTW I wonder if we could add a setting which can return false for 
TesseractOCRParser#hasTesseract even if we have tesseract available.

You can remove / add custom parsers through the TikaConfig. But I agree, its 
hard to maintain, because you have to provide a static list. I would really 
like to have a separate TikaConfig option to explicitely disable some parsers, 
so I can use the default SPI lookup, but blacklist parsers. We would like to 
do the same in Solr, too.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329474#comment-14329474
 ] 

Uwe Schindler commented on TIKA-1555:
-

bq. You can also disable OCR by setting the Tesseract path to  in the 
TesseractOCRConfig.

This did not work. If this would disable the fork I would be happy. But it just 
disables parser as side effect because it tries to fork an invalid process path 
which is created from empty string and sone sufix.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289125#comment-14289125
 ] 

Uwe Schindler commented on TIKA-1526:
-

[~grossws]: This bug is not in Maven itsself, the problem here is unsolved bug 
in the JDK itsself. Maven is perfectly fine, but because of the JDK bug, Maven 
cannot spawn external processes.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288963#comment-14288963
 ] 

Uwe Schindler commented on TIKA-1526:
-

I tried it with maven, but this is all too funny. This bug also affects Maven...

{noformat}
[uschindler@lucene ~]$ export MAVEN_OPTS=-Duser.language=tr
[uschindler@lucene ~]$ mvn
---
constituent[0]: 
file:/usr/local/share/java/maven3/lib/aether-connector-wagon-1.13.1.jar
constituent[1]: 
file:/usr/local/share/java/maven3/lib/maven-repository-metadata-3.0.4.jar
constituent[2]: 
file:/usr/local/share/java/maven3/lib/plexus-sec-dispatcher-1.3.jar
constituent[3]: file:/usr/local/share/java/maven3/lib/aether-spi-1.13.1.jar
constituent[4]: file:/usr/local/share/java/maven3/lib/maven-compat-3.0.4.jar
constituent[5]: 
file:/usr/local/share/java/maven3/lib/plexus-component-annotations-1.5.5.jar
constituent[6]: file:/usr/local/share/java/maven3/lib/plexus-cipher-1.7.jar
constituent[7]: file:/usr/local/share/java/maven3/lib/sisu-guava-0.9.9.jar
constituent[8]: file:/usr/local/share/java/maven3/lib/maven-core-3.0.4.jar
constituent[9]: file:/usr/local/share/java/maven3/lib/plexus-utils-2.0.6.jar
constituent[10]: 
file:/usr/local/share/java/maven3/lib/wagon-provider-api-2.2.jar
constituent[11]: 
file:/usr/local/share/java/maven3/lib/maven-plugin-api-3.0.4.jar
constituent[12]: 
file:/usr/local/share/java/maven3/lib/maven-model-builder-3.0.4.jar
constituent[13]: file:/usr/local/share/java/maven3/lib/maven-settings-3.0.4.jar
constituent[14]: 
file:/usr/local/share/java/maven3/lib/sisu-inject-bean-2.3.0.jar
constituent[15]: file:/usr/local/share/java/maven3/lib/wagon-http-2.2-shaded.jar
constituent[16]: 
file:/usr/local/share/java/maven3/lib/maven-aether-provider-3.0.4.jar
constituent[17]: 
file:/usr/local/share/java/maven3/lib/sisu-inject-plexus-2.3.0.jar
constituent[18]: file:/usr/local/share/java/maven3/lib/maven-artifact-3.0.4.jar
constituent[19]: file:/usr/local/share/java/maven3/lib/maven-model-3.0.4.jar
constituent[20]: file:/usr/local/share/java/maven3/lib/wagon-file-2.2.jar
constituent[21]: file:/usr/local/share/java/maven3/lib/maven-embedder-3.0.4.jar
constituent[22]: 
file:/usr/local/share/java/maven3/lib/sisu-guice-3.1.0-no_aop.jar
constituent[23]: 
file:/usr/local/share/java/maven3/lib/maven-settings-builder-3.0.4.jar
constituent[24]: 
file:/usr/local/share/java/maven3/lib/plexus-interpolation-1.14.jar
constituent[25]: file:/usr/local/share/java/maven3/lib/aether-impl-1.13.1.jar
constituent[26]: file:/usr/local/share/java/maven3/lib/aether-api-1.13.1.jar
constituent[27]: file:/usr/local/share/java/maven3/lib/aether-util-1.13.1.jar
constituent[28]: file:/usr/local/share/java/maven3/lib/commons-cli-1.2.jar
---
Exception in thread main java.lang.Error: posix_spawn is not a supported 
process launch mechanism on this platform.
at java.lang.UNIXProcess$1.run(UNIXProcess.java:111)
at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
at java.security.AccessController.doPrivileged(Native Method)
at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
at java.lang.Runtime.exec(Runtime.java:617)
at java.lang.Runtime.exec(Runtime.java:450)
at java.lang.Runtime.exec(Runtime.java:347)
at 
org.codehaus.plexus.interpolation.os.OperatingSystemUtils.getSystemEnvVars(OperatingSystemUtils.java:86)
at 
org.codehaus.plexus.interpolation.EnvarBasedValueSource.getEnvars(EnvarBasedValueSource.java:74)
at 
org.codehaus.plexus.interpolation.EnvarBasedValueSource.init(EnvarBasedValueSource.java:64)
at 
org.codehaus.plexus.interpolation.EnvarBasedValueSource.init(EnvarBasedValueSource.java:50)
at 
org.apache.maven.settings.building.DefaultSettingsBuilder.interpolate(DefaultSettingsBuilder.java:222)
at 
org.apache.maven.settings.building.DefaultSettingsBuilder.build(DefaultSettingsBuilder.java:101)
at org.apache.maven.cli.MavenCli.settings(MavenCli.java:725)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:193)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at

[jira] [Commented] (TIKA-1529) Turn forbidden-apis back on

2015-01-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289438#comment-14289438
 ] 

Uwe Schindler commented on TIKA-1529:
-

If you just check for ASCII chars in some string of unknown encoding, the 
easiest is to use US-ASCII as charset, this will always work, also with UTF-8 
:-)

 Turn forbidden-apis back on
 ---

 Key: TIKA-1529
 URL: https://issues.apache.org/jira/browse/TIKA-1529
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Priority: Minor

 [~thetaphi] recently noticed that forbidden-apis was turned off in r1624185, 
 and he submitted a patch to the dev list.  Let's turn it back on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289182#comment-14289182
 ] 

Uwe Schindler edited comment on TIKA-1526 at 1/23/15 12:32 PM:
---

To work around this bug you can in fact do this. It is just bad to change 
User's default locale, which may especially break multi-threaded applications.

One solution could be:
During startup of the JVM (in the Plexus launcher's main method) you can do the 
following:
- check for locale, we do this like that: {{new 
Locale(tr).getLanguage().equals(Locale.getDefault().getLanguage())}} (it is 
important to do the check like this, because otherwise its not guaranteed that 
it really works, especially in newer java versions!!!)
- if its such a locale, switch to Locale.ROOT (save original) in a 
single-threaded environment (this is why it should be in main launcher)
- execute a fake UNIX command, like /bin/true. You can also execute some 
non-existing bullshit that just fails. The call is just there to statically 
initalize the broken UnixProcess class. Once it is initialized correctly it 
works
- switch back to saved locale


was (Author: thetaphi):
To work around this bug you can in fact do this. It is just bad to change 
User's default locale, which may especially break multi-threaded applications.

One solution could be:
During startup of the JVM (in the Plexus launcher's main method) you can do the 
following:
- check for locale, we do this like that: {{new 
Locale(tr).getLanguage().equals(Locale.getDefault().getLanguage())}} (it is 
important to do the check like this, because otherwise its not guaranteed that 
it really works, especially in newer java versions!!!)
- if its such a locale, switch to Locale.ROOT (save original) in a 
single-threaded environment (this is why it should be in main launcher)
- execute a fake UNIX command, like /bin/true. You can also execute northing, 
it is just there to statically initalize the broken UnixProcess class. Once it 
is initialized correctly it works
- switch back to saved locale

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-23 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289182#comment-14289182
 ] 

Uwe Schindler commented on TIKA-1526:
-

To work around this bug you can in fact do this. It is just bad to change 
User's default locale, which may especially break multi-threaded applications.

One solution could be:
During startup of the JVM (in the Plexus launcher's main method) you can do the 
following:
- check for locale, we do this like that: {{new 
Locale(tr).getLanguage().equals(Locale.getDefault().getLanguage())}} (it is 
important to do the check like this, because otherwise its not guaranteed that 
it really works, especially in newer java versions!!!)
- if its such a locale, switch to Locale.ROOT (save original) in a 
single-threaded environment (this is why it should be in main launcher)
- execute a fake UNIX command, like /bin/true. You can also execute northing, 
it is just there to statically initalize the broken UnixProcess class. Once it 
is initialized correctly it works
- switch back to saved locale

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-22 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288444#comment-14288444
 ] 

Uwe Schindler commented on TIKA-1526:
-

Hi Tylor: The problem is explained above. To replicate the problem you have to 
be careful: The original error happens *exactly once*. All later tries to use 
the same JVM will cause a NoClassDefFoundError on UnixProcess class. In fact 
all later tries to execute and fork a process will fail, but with a 
NoClassDefFoundError. Unfortunately I am very tired at the moment, it is past 
midnight.

The main problem is that all other ExternalParserTests will/may fail afterwards 
in the same JVM if the turkish locale is used.

The commit will fix the issue we see in Solr, but the original issue may still 
survive if you really try to use ExternalParser for other tests. For which 
other parsers is it used currently? Only for tesseract or also other ones? In 
Solr we have the problem, because the TesseractParser fails to execute the 
initialization (which MIME types it is responsble for) - and thats the fatal 
problem. I have no idea about other parsers, if they just fail while parsing I 
don't care. The big problem is the Tesseract parser that fails in turkish 
locale and blocks other parsers to execute, because the call to 
getSupportedTypes() fails [and thats the horrible thing in this bug].

So basically to reproduce: Choose exactly one test you know that fails and try 
with and without the patch. Don't run other tests that may spawn processes in 
the same JVM.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking

[jira] [Comment Edited] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-22 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288444#comment-14288444
]

Uwe Schindler edited comment on TIKA-1526 at 1/22/15 11:29 PM:
---

Hi Tylor: The problem is explained above. To reproduce the problem you have to
be careful: The original error happens *exactly once*. All later tries to use
the same JVM will cause a NoClassDefFoundError on UnixProcess class.
Unfortunately I am very tired at the moment, it is past midnight.

The main problem is that all other ExternalParserTests will/may fail afterwards
in the same JVM if the turkish locale is used.

The commit will fix the issue we see in Solr, but the original issue may still
survive if you really try to use ExternalParser for other tests. For which
other parsers is it used currently? Only for tesseract or also other ones? In
Solr we have the problem, because the TesseractParser fails to execute the
initialization (which MIME types it is responsble for) - and thats the fatal
problem. I have no idea about other parsers, if they just fail while parsing I
don't care. The big problem is the Tesseract parser that fails in turkish
locale and blocks other parsers to execute, because the call to
getSupportedTypes() fails [and thats the horrible thing in this bug].

So basically to reproduce: Choose exactly one test you know that fails and try
with and without the patch. Don't run other tests that may spawn processes in
the same JVM.

was (Author: thetaphi):
Hi Tylor: The problem is explained above. To replicate the problem you have to
be careful: The original error happens *exactly once*. All later tries to use
the same JVM will cause a NoClassDefFoundError on UnixProcess class. In fact
all later tries to execute and fork a process will fail, but with a
NoClassDefFoundError. Unfortunately I am very tired at the moment, it is past
midnight.

The main problem is that all other ExternalParserTests will/may fail afterwards
in the same JVM if the turkish locale is used.

So basically to reproduce: Choose exactly one test you know that fails and try
with and without the patch. Don't run other tests that may spawn processes in
the same JVM.

ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so
Turkish Tika users can still use non-external parsers

Key: TIKA-1526
URL: https://issues.apache.org/jira/browse/TIKA-1526
Project: Tika
Issue Type: Wish
Reporter: Hoss Man

the JDK has numerous pain points regarding the Turkish locale, posix_spawn
lowercasing being one of them...
https://bugs.openjdk.java.net/browse/JDK-8047340
https://bugs.openjdk.java.net/browse/JDK-8055301
As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is
enabled configured by default in Tika, and uses ExternalParser.check to see
if tesseract is available -- but because of the JDK bug, this means that Tika
fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like
so...
{noformat}
[junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported
process launch mechanism on this platform.
[junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
[junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
[junit4] at java.security.AccessController.doPrivileged(Native
Method)
[junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
[junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130)
[junit4] at
java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
[junit4] at java.lang.Runtime.exec(Runtime.java:620)
[junit4] at java.lang.Runtime.exec(Runtime.java:485)
[junit4] at
org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
[junit4] at
org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
[junit4] at

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-22 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287820#comment-14287820
 ] 

Uwe Schindler commented on TIKA-1526:
-

FYI: The underlying bug in the JVM will never be fixed in Java 6. Java 9 
previews are no longer affected, but Java 7 and Java 8 are still broken 
(inclduing the update from yesterday). Oracle possibly will fix in 7u80 (last 
Java 7 release before EOL) and 8u40.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-22 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287824#comment-14287824
 ] 

Uwe Schindler commented on TIKA-1526:
-

Tim: Linux does not use posis spawn. You ned MacOSX or Solaris. Oracle has a 
completely different implementation for spawning processes in Linux.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-22 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287824#comment-14287824
 ] 

Uwe Schindler edited comment on TIKA-1526 at 1/22/15 5:36 PM:
--

Tim: Linux does not use posix spawn. You ned MacOSX or Solaris. Oracle has a 
completely different implementation for spawning processes in Linux.


was (Author: thetaphi):
Tim: Linux does not use posis spawn. You ned MacOSX or Solaris. Oracle has a 
completely different implementation for spawning processes in Linux.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-01-22 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287850#comment-14287850
 ] 

Uwe Schindler commented on TIKA-1526:
-

There is also a second problem: The bug is in the static \{\} initializer of 
the UNIXProcess class. So it happens only when the class is loaded first. If it 
was loaded correctly, the class is initialized with right settings and passes 
(in fact, also with turkish locale). But if it fails for the first time, 
UNIXProcess is broken for the whole lifetime of the JVM (also with good 
locales). Because UNIXProcess failed to initialize, the JVM marks it as broken 
and you get a NoClassDefFound error.

The problem does not happen on Linux, because the default value of the 
problematic system property is there initialized with some other value, that 
does not contain i which is affected by the famous upper/lowercasing bug: 
http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

So if some other test is executing some external process before and has 
non-Turkish locale, also calls with turkish succeed. Because of that we test 
Lucene with all possible Locales set before the JVM is starting. We don't 
switch the locale actively during tests.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this

[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2015-01-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283723#comment-14283723
 ] 

Uwe Schindler commented on TIKA-1435:
-

Indeed this confused me while doing the Apache Solr update (SOLR-6991). Apache 
Lucene/Solr does not allow transitive dependencies, so everything is declared 
explicit using IVY. This caused some headache while doing mvn:list-dependencies 
and checking all of them manually.

 Update rome dependency to 1.5
 -

 Key: TIKA-1435
 URL: https://issues.apache.org/jira/browse/TIKA-1435
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Johannes Mockenhaupt
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.8

 Attachments: netcdf-deps-changes.diff


 Rome 1.5 has been released to Sonatype 
 (https://github.com/rometools/rome/issues/183). Though the website 
 (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
 is mostly maintenance, adopting slf4j and generics as well as moving the 
 namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148
 ] 

Uwe Schindler commented on TIKA-1523:
-

Hi, I did some recherche:
This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the 
page count is wrong initially on saving, if you don't scroll to the end. People 
were complaining about that at that time, too, because it caused sometimes the 
total page number in footnotes to be incorrect, too.

http://support.microsoft.com/kb/212653/en-us

See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html 
(German only, 1st comment):

{quote}
SSD 26.04.2004, 21:07
Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. 
Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den 
Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen 
und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl 
(steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran 
kann das liegen, wie kann ich das ändern?
{quote}

And: 
https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs

You see, initially the page count is wrong. If you open a file with Word 2000 / 
9.0 and safe it without waiting until the full count was calculated (computers 
were slower at that time), it saved 1. :-)

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116
]

Uwe Schindler edited comment on TIKA-1523 at 1/19/15 10:50 PM:
---

Yes. I extracts just the metadata with COM interface for the quickview windows
component (you don't even need Word installed for that). So I think this is an
issue with this old version of Word.

In fact when you open the file in Word, it of course shows the real pages and
it also recalculates the count, but initially it also shows 1. But here, the
metadata as saved in the file is simply 1 or maybe nothing (see below). POI
does not reflow the layout to calculate that information.

This is why the metadata is only updated by the word processing program on
opening and editing the file. If you instruct Word 2010 to open the file read
only (which it does because its downloaded from internet), it shows in the
page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or
POI's issue.

was (Author: thetaphi):
Yes. I extracts just the metadata. So I think this is an issue with this old
version of Word.

metadata extractor gets the wrong number of pages of some documents Microsoft
Word 9.0
--

Key: TIKA-1523
URL: https://issues.apache.org/jira/browse/TIKA-1523
Project: Tika
Issue Type: Bug
Components: metadata
Affects Versions: 1.7
Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png

When I extract the metadata from a Microsoft Word 9.0 document which has 10
pages extractor gives me the result that only has 1 page.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: screenshot-2.png

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: (was: screenshot-2.png)

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: screenshot-2.png

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116
 ] 

Uwe Schindler commented on TIKA-1523:
-

Yes. I extracts just the metadata. So I think this is an issue with this old 
version of Word.

In fact when you open the file in Word, it of course shows the real pages and 
it also recalculates the count, but initially it also shows 1. But here, the 
metadata as saved in the file is simply 1 or maybe nothing (see below). POI 
does not reflow the layout to calculate that information.

This is why the metadata is only updated by the word processing program on 
opening and editing the file. If you instruct Word 2010 to open the file read 
only (which it does because its downloaded from internet), it shows  in the 
page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or 
POI's issue.

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148
 ] 

Uwe Schindler edited comment on TIKA-1523 at 1/19/15 11:16 PM:
---

Hi, I did some recherche:
This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the 
page count is wrong initially on saving, if you don't scroll to the end. People 
were complaining about that at that time, too, because it caused sometimes the 
total page number in footnotes to be incorrect, too.

http://support.microsoft.com/kb/212653/en-us

See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html 
(German only, 1st comment):

{quote}
SSD 26.04.2004, 21:07
Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. 
Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den 
Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen 
und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl 
(steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran 
kann das liegen, wie kann ich das ändern?
{quote}

And: 
https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs

{quote}
Anyone can help me with this? If I take out Sleep 1,
myDoc.BuiltinDocumentProperties(wdPropertyPages) doesnt return the correct
number of pages sometimes. For example, if a document has 200 pages, it may
come out to return 140, or sometimes 199, instead of 200. To me, it seems it
takes some time for MS word to think and get the number of pages. After i
put Sleep 1, 99% I got the correct number of pages. However, this will
take very long time to process as I need to read 200 to 300 files and the
number of pages from each files. Please let me know if there is another
better solution for this.
{quote}

You see, initially the page count is wrong. If you open a file with Word 2000 / 
9.0 and save it without waiting until the full count was calculated (computers 
were slower at that time), it saved 1. :-)


was (Author: thetaphi):
Hi, I did some recherche:
This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the 
page count is wrong initially on saving, if you don't scroll to the end. People 
were complaining about that at that time, too, because it caused sometimes the 
total page number in footnotes to be incorrect, too.

http://support.microsoft.com/kb/212653/en-us

See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html 
(German only, 1st comment):

{quote}
SSD 26.04.2004, 21:07
Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. 
Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den 
Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen 
und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl 
(steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran 
kann das liegen, wie kann ich das ändern?
{quote}

And: 
https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs

You see, initially the page count is wrong. If you open a file with Word 2000 / 
9.0 and safe it without waiting until the full count was calculated (computers 
were slower at that time), it saved 1. :-)

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1523:

Attachment: screenshot-1.png

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

2015-01-19 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283092#comment-14283092
 ] 

Uwe Schindler commented on TIKA-1523:
-

If I save the file with Office 2010, the page number is updated and shows 
correct in right-click/Properties. TIKA also shows it.

 metadata extractor gets the wrong number of pages of some documents Microsoft 
 Word 9.0
 --

 Key: TIKA-1523
 URL: https://issues.apache.org/jira/browse/TIKA-1523
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.7
 Environment: Ubuntu
Reporter: Yamileydis Veranes
Assignee: Konstantin Gribov
 Attachments: Sigmund Freud.doc, screenshot-1.png


 When I extract the metadata from a Microsoft Word 9.0 document which has 10 
 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1457) NullPointerException in tika-app, parsing PDF content

2014-10-28 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186533#comment-14186533
 ] 

Uwe Schindler commented on TIKA-1457:
-

Hi,
the next version of Solr with TIKA 1.6 will be Solr 5.0, there will be no more 
4.x releases (except bugfix/security). If TIK 1.7 comes out in the meantime, we 
will update.
About replacing TIKA in a given Solr installation: Yes this may work in most 
cases. For the change TIKA 1.5 - TIKA 1.6 in current Lucene/Solr 5.x branch, I 
only changed the dependencies - code changes in the main source code were not 
needed (the API of TIKA itsself is quite stable). I only had to fix one test 
because of an additional new header X-Parsed-By, which made the test fail.

 NullPointerException in tika-app, parsing PDF content
 -

 Key: TIKA-1457
 URL: https://issues.apache.org/jira/browse/TIKA-1457
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS - Linux Centos 6.5
 Web APP - Tomcat6
 Using Solr 4.10
 Tika Jar
   * tika-core-1.5.jar
   * tika-parsers-1.5.jar
   * tika-xmp-1.5.jar
   * pdfbox-1.8.4.jar
Reporter: Tadeu Alves
  Labels: bug, parser, solr, tika,text-extraction
 Fix For: 1.6


 When I try to extract text from some pdf files with the tika app 1.5
 null:org.apache.solr.common.SolrException: 
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.pdf.PDFParser@52cfcf01
   at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
   at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
   at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
   at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
   at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.tika.exception.TikaException: Unexpected 
 RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
   ... 19 more
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
 range: 0
   at java.lang.String.charAt(String.java:658)
   at 
 org.apache.pdfbox.util.DateConverter.parseDate(DateConverter.java:680)
   at 
 org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:808)
   at 
 org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:780)
   at 
 org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:754)
   at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:797)
   at 
 org.apache.pdfbox.pdmodel.PDDocumentInformation.getModificationDate(PDDocumentInformation.java:232)
   at 
 org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:176)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:142)
   at

[jira] [Comment Edited] (TIKA-1457) NullPointerException in tika-app, parsing PDF content

2014-10-28 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186533#comment-14186533
 ] 

Uwe Schindler edited comment on TIKA-1457 at 10/28/14 7:50 AM:
---

Hi,
the next version of Solr with TIKA 1.6 will be Solr 5.0, there will be no more 
4.x releases (except bugfix/security). If TIKA 1.7 comes out in the meantime, 
we will update.
About replacing TIKA in a given Solr installation: Yes this may work in most 
cases. For the change TIKA 1.5 - TIKA 1.6 in current Lucene/Solr 5.x branch, I 
only changed the dependencies - code changes in the main source code were not 
needed (the API of TIKA itsself is quite stable). I only had to fix one test 
because of an additional new header X-Parsed-By, which made the test fail. Be 
sure to exchange *all* JAR files (not only TIKA, also its deps) in 
contrib/extraction/lib!!!


was (Author: thetaphi):
Hi,
the next version of Solr with TIKA 1.6 will be Solr 5.0, there will be no more 
4.x releases (except bugfix/security). If TIKA 1.7 comes out in the meantime, 
we will update.
About replacing TIKA in a given Solr installation: Yes this may work in most 
cases. For the change TIKA 1.5 - TIKA 1.6 in current Lucene/Solr 5.x branch, I 
only changed the dependencies - code changes in the main source code were not 
needed (the API of TIKA itsself is quite stable). I only had to fix one test 
because of an additional new header X-Parsed-By, which made the test fail.

 NullPointerException in tika-app, parsing PDF content
 -

 Key: TIKA-1457
 URL: https://issues.apache.org/jira/browse/TIKA-1457
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS - Linux Centos 6.5
 Web APP - Tomcat6
 Using Solr 4.10
 Tika Jar
   * tika-core-1.5.jar
   * tika-parsers-1.5.jar
   * tika-xmp-1.5.jar
   * pdfbox-1.8.4.jar
Reporter: Tadeu Alves
  Labels: bug, parser, solr, tika,text-extraction
 Fix For: 1.6


 When I try to extract text from some pdf files with the tika app 1.5
 null:org.apache.solr.common.SolrException: 
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.pdf.PDFParser@52cfcf01
   at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
   at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
   at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
   at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
   at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.tika.exception.TikaException: Unexpected 
 RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
   ... 19 more
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
 range: 0
   at java.lang.String.charAt(String.java:658)

[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-10-25 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184073#comment-14184073
 ] 

Uwe Schindler commented on TIKA-1387:
-

I think this is already committed an working. I think the issue was just not 
closed. In parent/pom.xml the plugin is enabled... So We can keep 1.7 as fix 
version and resolve this issue. Or do I miss something?

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
Assignee: Tyler Palsulich
 Fix For: 1.8

 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, 
 TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation: 
 java.lang.String#format(java.lang.String,java.lang.Object[])

[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095798#comment-14095798
 ] 

Uwe Schindler commented on TIKA-1387:
-

I think, for messages written in english language (like those written to 
logs), ENGLISH is more correct. But it does not really matter.

About the charsets:
I would define a constant in IOUtils {{public final Charset UTF_8 = 
Charset.forName(UTF-8);}} and then pass this to all methods that accept it 
(like Readers, String,...). This is also faster than a sychronized String 
lookup on every conversion, like done by the standard default charset or String 
charset parameter.

Java 7 has StandardCharsets.UTF_8 but we cannot use this at the moment. But its 
defined like the one I propose for IOUtils.

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
Assignee: Tyler Palsulich
 Fix For: 1.7

 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, 
 TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default

[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-13 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095853#comment-14095853
 ] 

Uwe Schindler commented on TIKA-1387:
-

Nick: in ImageMetadataExtractor.java, the date format is static, so it does 
not help that a new instance is created. If you remove the static it should 
be fine.

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
Assignee: Tyler Palsulich
 Fix For: 1.7

 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, 
 TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation: 
 java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
 locale]
 [ERROR]   in

[jira] [Created] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)

Uwe Schindler created TIKA-1387:
---

 Summary: Add forbidden-apis checker to TIKA build
 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler


Lucene and many other projects already use the forbidden-apis checker to 
prevent use of some broken classes/signatures from the JDK. These are 
especially thing using default character sets or default locales. The 
forbidden-api checker can also be used to explcitely disallow specific methods, 
if they have security issues (e.g., creating XML parsers without disabling 
external entity support).

The attached patch adds the forbidden-api checker to the tika-parent pom file 
with default configuration.

Running it fails with many errors in TIKA core already:

{noformat}
[INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
[INFO] Scanning for classes to check...
[INFO] Reading bundled API signatures: jdk-unsafe
[INFO] Reading bundled API signatures: jdk-deprecated
[INFO] Loading classes to check...
[INFO] Scanning for API signatures and dependencies...
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
(LanguageProfilerBuilder.java:407)
[ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
default locale]
[ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
[ERROR] Forbidden method invocation: 
java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
[ERROR] Forbidden method invocation: 
java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
[ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
(WriteOutContentHandler.java:93)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.parser.external.ExternalParser 
(ExternalParser.java:234)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
(ExternalParser.java:294)
[ERROR] Forbidden method invocation: 
java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
zone]
[ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
[ERROR] Forbidden method invocation: 
java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
locale]
[ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:91)
[ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses 
default locale]
[ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:98)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:100)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:396)
[ERROR] Forbidden method invocation:

[jira] [Created] (TIKA-1386) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)

Uwe Schindler created TIKA-1386:
---

 Summary: Add forbidden-apis checker to TIKA build
 Key: TIKA-1386
 URL: https://issues.apache.org/jira/browse/TIKA-1386
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler


Lucene and many other projects already use the forbidden-apis checker to 
prevent use of some broken classes/signatures from the JDK. These are 
especially thing using default character sets or default locales. The 
forbidden-api checker can also be used to explcitely disallow specific methods, 
if they have security issues (e.g., creating XML parsers without disabling 
external entity support).

The attached patch adds the forbidden-api checker to the tika-parent pom file 
with default configuration.

Running it fails with many errors in TIKA core already:

{noformat}
[INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
[INFO] Scanning for classes to check...
[INFO] Reading bundled API signatures: jdk-unsafe
[INFO] Reading bundled API signatures: jdk-deprecated
[INFO] Loading classes to check...
[INFO] Scanning for API signatures and dependencies...
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
(LanguageProfilerBuilder.java:407)
[ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
default locale]
[ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
[ERROR] Forbidden method invocation: 
java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
[ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
[ERROR] Forbidden method invocation: 
java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
[ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
(WriteOutContentHandler.java:93)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.parser.external.ExternalParser 
(ExternalParser.java:234)
[ERROR] Forbidden method invocation: 
java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
[ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
(ExternalParser.java:294)
[ERROR] Forbidden method invocation: 
java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
zone]
[ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
[ERROR] Forbidden method invocation: 
java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
locale]
[ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:91)
[ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses 
default locale]
[ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:98)
[ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default 
charset]
[ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:100)
[ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
default charset]
[ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:396)
[ERROR] Forbidden method invocation:

[jira] [Closed] (TIKA-1386) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed TIKA-1386.
---

Resolution: Duplicate

JIRA hung and created the issue 2 times.

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1386
 URL: https://issues.apache.org/jira/browse/TIKA-1386
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler

 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation: 
 java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
 locale]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:91)
 [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:98)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in

[jira] [Updated] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1387:


Attachment: TIKA-1387.patch

This patch refactors the tika-java7 module a bit, so the forbidden-api checker 
also uses the correct signatures (Java 7). This was done by redefining the 
parent-pom properties instead of duplicating the compiler and forbidden plugins.

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
 Attachments: TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation: 
 java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
 locale]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:91)
 [ERROR] Forbidden method

[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087489#comment-14087489
 ] 

Uwe Schindler commented on TIKA-1387:
-

One suggestion:
The official name of the properties for source/target are: 
maven.compile*r*.target and maven.compile*r*.source. I would suggest to 
change those. If this is dane, you can remove the explicit declaration in the 
plugin properties, because the maven compiler plugin and the maven 
forbiddenapis plugin read those properties.

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
 Attachments: TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation:

[jira] [Updated] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1387:


Attachment: TIKA-1387.patch

Patch with renamed properties to conform to Maven standards.

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
 Attachments: TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation: 
 java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
 locale]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:91)
 [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.detect.MagicDetector (MagicDetector.java:98)
 [ERROR]

[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088088#comment-14088088
 ] 

Uwe Schindler commented on TIKA-1387:
-

Hi I left a comment in the review. Was out for dinner. I would fix the issues 
in a different way at some places. Especially 
String#toLowerCase(Locale.getDefault()), which has crazy effects in some 
languages (in Turkish not even ASCII lower cases as expected).

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
Assignee: Tyler Palsulich
 Fix For: 1.7

 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, 
 TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation:

[jira] [Reopened] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-08-06 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reopened TIKA-1387:
-


I disagree wth some fixes, because they just workaround the forbidden-checks by 
still using system defaults.

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
Assignee: Tyler Palsulich
 Fix For: 1.7

 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, 
 TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation: 
 java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
 locale]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:91)
 [ERROR] Forbidden method invocation:

[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918634#comment-13918634
 ] 

Uwe Schindler commented on TIKA-1252:
-

This could be a problem in Solr's DataImportHandler. I am not 100% sure, if 
this one supports multiple values per key. Maybe it is using a Map... In any 
case, if this is caused by Solr, I will move the issue over to SOLR.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell

 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918643#comment-13918643
 ] 

Uwe Schindler commented on TIKA-1252:
-

I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate values (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell

 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918643#comment-13918643
 ] 

Uwe Schindler edited comment on TIKA-1252 at 3/3/14 10:17 PM:
--

I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate keys (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.


was (Author: thetaphi):
I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate values (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell

 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1211) OpenDocument (ODF) parser produces multipe startDocument() events

2013-12-17 Thread Uwe Schindler (JIRA)

Uwe Schindler created TIKA-1211:
---

 Summary: OpenDocument (ODF) parser produces multipe 
startDocument() events
 Key: TIKA-1211
 URL: https://issues.apache.org/jira/browse/TIKA-1211
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Uwe Schindler


Related to SOLR-4809: Solr receives multiple startDocument events when parsing 
OpenDocumentFiles.

The parser already prevents multiple endDocuments, but not multiple 
startDocuments.

The bug was introduced when we added parsing content.xml and meta.xml 
(TIKA-736, but both feed elements to the XHTML output, so we get multiple 
start/endDocuments).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Updated] (TIKA-1211) OpenDocument (ODF) parser produces multiple startDocument() events

2013-12-17 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated TIKA-1211:


Summary: OpenDocument (ODF) parser produces multiple startDocument() events 
 (was: OpenDocument (ODF) parser produces multipe startDocument() events)

 OpenDocument (ODF) parser produces multiple startDocument() events
 --

 Key: TIKA-1211
 URL: https://issues.apache.org/jira/browse/TIKA-1211
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Uwe Schindler

 Related to SOLR-4809: Solr receives multiple startDocument events when 
 parsing OpenDocumentFiles.
 The parser already prevents multiple endDocuments, but not multiple 
 startDocuments.
 The bug was introduced when we added parsing content.xml and meta.xml 
 (TIKA-736, but both feed elements to the XHTML output, so we get multiple 
 start/endDocuments).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (TIKA-1211) OpenDocument (ODF) parser produces multiple startDocument() events

2013-12-17 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850416#comment-13850416
 ] 

Uwe Schindler commented on TIKA-1211:
-

There are multiple ways to fix this:
- Make XHTMLContentHandler prevent multiple startDocument() events. I think 
thats easiest and most correct. XHTMLContentHandler already has some magic in 
there.
- Add an additional contenthandler that removes subsequent startDocuments (this 
is the same as above, just in a separate handler)



 OpenDocument (ODF) parser produces multiple startDocument() events
 --

 Key: TIKA-1211
 URL: https://issues.apache.org/jira/browse/TIKA-1211
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Uwe Schindler

 Related to SOLR-4809: Solr receives multiple startDocument events when 
 parsing OpenDocumentFiles.
 The parser already prevents multiple endDocuments, but not multiple 
 startDocuments.
 The bug was introduced when we added parsing content.xml and meta.xml 
 (TIKA-736, but both feed elements to the XHTML output, so we get multiple 
 start/endDocuments).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (TIKA-1181) RTFParser not keeping HTML font colors and underscore tags.

2013-10-07 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788171#comment-13788171
 ] 

Uwe Schindler commented on TIKA-1181:
-

Other parsers like OpenOffice do not preserve colors, too.

 RTFParser not keeping HTML font colors and underscore tags.
 ---

 Key: TIKA-1181
 URL: https://issues.apache.org/jira/browse/TIKA-1181
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows server 2008
Reporter: Leo
  Labels: RTFParser

 Hi,
 I'm having problems with this code. It does not put the font colors and 
 underscores u/u tags in the HTML from the RTF string. Is there anything 
 I can do to put them there? 
 Code:
 InputStream in = new ByteArrayInputStream(rtfString.getBytes(UTF-8));  
  
 org.apache.tika.parser.rtf.RTFParser parser = new 
 org.apache.tika.parser.rtf.RTFParser();
  
 Metadata metadata = new Metadata();
 StringWriter sw = new StringWriter();
 SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
 TransformerHandler handler = factory.newTransformerHandler();
   
 handler.getTransformer().setOutputProperty(OutputKeys.METHOD, xml);
   
 handler.getTransformer().setOutputProperty(OutputKeys.INDENT, no);
 handler.setResult(new StreamResult(sw));
 parser.parse(in, handler, metadata, new ParseContext());
 String xhtml = sw.toString();
   
 xhtml = xhtml.replaceAll(\r\n, br\r\n);
 Thanks for looking at it.
 Leo



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML

2013-08-09 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734769#comment-13734769
]

Uwe Schindler commented on TIKA-1134:
-

Hoss: I agree to fix this in the documentation.

On the SOLR-4679 i explained in more details *why TIKA is doing this*:

{quote}
Let me recapitulate TIKA's problems:

- TIKA decided to use XHTML as its output format to report the parsed documents
to the consumer. This is nice, because it allows to preserve some of the
formatting (like bold fonts, paragraphs,...) originating from the original
document. Of course most of this formatting is lost, but you can still detect
things like emphasized text. By choosing XHTML as output format, of course TIKA
must use XHTML formatting for new lines and similar. So whenever a line break
is needed, the TIKA pasrer emits a br/ tag or places the paragraph (in a
PDF) inside a p/ element. As we all know, HTML ignores formatting like
newlines, tabs,... (all are treated as one single whitespace, so means like
this regreplace: {{s/\s+/ /}}
- On the other hand, TIKA wants to make it simple for people to extract the
*plain text* contents. With the XHTML-only approach this would be hard for the
consumer. Because to add the correct newlines, the consumer has to fully
understand XHTML and detect block elements and replace them by \n

To support both usages of TIKA the idea was to embed this information which is
unimportant to HTML (as HTML ignores whitespaces completely) as
ignorableWhitespace as convenience for the user. A fully compliant XHTML
consumer would not parse the ignoreable stuff. As it understands HTML it would
detect a p element as a block element and format the output.

Solr unfortunately has some strange approach: It is mainly interested in the
text only contents, so ideally when consuming the HTLL it could use
{{WriteoutContentHandler(StringBuilder,
BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the
right thing automatically: It would extract only text from the body element and
would use the convenience whitespace to format the text in ASCII-ART-like way
(using tabs, newlines,...) :-)
Solr has a hybrid approach: It collects all into a content tag (which is
similar to the above approcha), but the bug is that in contrast to TIKA's
official WriteOutContentHandler it does not use the ignorable whitespace
inserted for convenience. In addition TIKA also has a stack where it allows to
process parts of the documents (like the title element or all em elements).
In that case it has several StringBuilders in parallel that are populated with
the contents. The problems are here too, but cannot be solved by using
ignorable whitespace: e.g. one indexes only all em elements (which are inline
HTML elements no block elements), there is no whitespace so all em elements
would be glued together in the em field of your index... I just mention this,
in my opinion the SolrContentHandler needs more work to correctly understand
HTML and not just collect element names in a map!

Now to your complaint: You proposed to report the newlines as real
{{character()}} events - but this is not the right thing to do here. As I said,
HTML does not know these characters, they are ignored. The formatting is done
by the element names (like p, div, table). So the helper whitespace for
text-only consumers should be inserted as ignorableWhitespace only, if we would
add it to the real character data we would report things that every HTML parser
(like nekohtml) would never report to the consumer. Nekohtml would also report
this useless extra whitespace as ignorable.

The convenience here is that TIKA's XHTMLContentHandler used by all parsers is
configured to help the text-only user, but don't hurt the HTML-only user.
This differentiation is done by reporting the HTML element names (p, div,
table, th, td, tr, abbr, em, strong,...) but also report the
ASCII-ART-text-only content like TABs indide tables, newlines after block
elements,... This is always done as ignorableWhitespace (for convenience), a
real HTML parser must ignore it - and its correct to do this.
{quote}

I think we should document this in the javadocs or the howto page, so
implementors of ContentHandlers know what to do!

ContentHandler gets ignorable whitespace for br tags when parsing HTML

Key: TIKA-1134
URL: https://issues.apache.org/jira/browse/TIKA-1134
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Hoss Man
Attachments: TIKA-1134.patch

I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding
something here, but it appears that the way Tika parses HTML to produce XHTML
SAX events is missinterpreting br tags as

[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML

2013-08-08 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733344#comment-13733344
 ] 

Uwe Schindler commented on TIKA-1134:
-

Hi Hoss,
the rule in TIKA is:
- TIKA inserts ignoreableWhitespace to support plain-text extraction on block 
elements and br/ tags (which are also somehow empty block elements) - see 
TIKA-171. Nothing else will insert ignorableWhitespace into the content 
handler. This means, consumers that are only interested in the *plain text* 
contents of parsed files, should ignore all HTML syntax elements and just treat 
ignorableWhitespace as significant - this is what TextOnlyContentHandler does 
to extract text. This was decided in TIKA-171 long time ago. If you are 
interested in *structured* HTML output, use the XHTML elements and ignore the 
whitespace.

 ContentHandler gets ignorable whitespace for br tags when parsing HTML
 

 Key: TIKA-1134
 URL: https://issues.apache.org/jira/browse/TIKA-1134
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Hoss Man
 Attachments: TIKA-1134.patch


 I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding 
 something here, but it appears that the way Tika parses HTML to produce XHTML 
 SAX events is missinterpreting br tags as equivilent to ignorable 
 whitespace containing a newline.  This means that clients who ask Tika to 
 parse files, and specify their own ContentHandler to capture the character 
 data can get sequences of run-on text w/o knowing that the br tag was 
 present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it 
 as real whitespace -- but this creates a catch-22 if you really do want to 
 ignore the ignorable whitespace in the HTML markup.
 The crux of the problem seems to be:
  * instead of generating a startElement event for br the HtmlParser treats 
 it as a xhtml.newline().
  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a 
 characters SAX event
 ...either one of these by themselves might be fine, but in combination they 
 don't really make any sense.  If for example an actual newline exists in the 
 html, it comes across as part of a characters SAX event, not as ignorbale 
 whitespace.
 Changing the newline() function to delegate to characters(...) seems to solve 
 the problem for br tags in HTML, but breaks several tests -- probably 
 because the newline() function is also used to add intentionally add 
 (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML

2013-08-08 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733348#comment-13733348
 ] 

Uwe Schindler commented on TIKA-1134:
-

I think this issue is Won't fix. The issues described by Hoss are caused by 
user error :-) So maybe keep this open to make javadocs inside all those 
wrapper ContentHandlers like BodyContentHandler to explicitely state that those 
extract plain text and add extra whitespace to support this.

 ContentHandler gets ignorable whitespace for br tags when parsing HTML
 

 Key: TIKA-1134
 URL: https://issues.apache.org/jira/browse/TIKA-1134
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Hoss Man
 Attachments: TIKA-1134.patch


 I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding 
 something here, but it appears that the way Tika parses HTML to produce XHTML 
 SAX events is missinterpreting br tags as equivilent to ignorable 
 whitespace containing a newline.  This means that clients who ask Tika to 
 parse files, and specify their own ContentHandler to capture the character 
 data can get sequences of run-on text w/o knowing that the br tag was 
 present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it 
 as real whitespace -- but this creates a catch-22 if you really do want to 
 ignore the ignorable whitespace in the HTML markup.
 The crux of the problem seems to be:
  * instead of generating a startElement event for br the HtmlParser treats 
 it as a xhtml.newline().
  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a 
 characters SAX event
 ...either one of these by themselves might be fine, but in combination they 
 don't really make any sense.  If for example an actual newline exists in the 
 html, it comes across as part of a characters SAX event, not as ignorbale 
 whitespace.
 Changing the newline() function to delegate to characters(...) seems to solve 
 the problem for br tags in HTML, but breaks several tests -- probably 
 because the newline() function is also used to add intentionally add 
 (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1145) classloaders issue loading resources when extending Tika

2013-07-04 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699793#comment-13699793
 ] 

Uwe Schindler commented on TIKA-1145:
-

I think the main problem is ServiceLoader's definition. It uses context class 
loader to load SPIs which is in my opinion a bug in the spec. In Lucene we 
had the same problems with our own ServiceLoader impl that uses the Abstract 
class'/interface's classloader to loads its own implementations. See 
LUCENE-4713 for more info, where Lucene uses SPI to load its codecs and 
analyzers.

 classloaders issue loading resources when extending Tika
 

 Key: TIKA-1145
 URL: https://issues.apache.org/jira/browse/TIKA-1145
 Project: Tika
  Issue Type: Bug
  Components: config, mime
Affects Versions: 1.3
 Environment: Tika as part of standard Solr distribution
Reporter: Maciej Lizewski

 I noticed that ServiceLoader is using different classloader when loading 
 'services' like Parsers, etc (java.net.FactoryURLClassLoader) than 
 MimeTypesFactory (org.eclipse.jetty.webapp.WebAppClassLoader) when loading 
 mime types definitions. As result - it works completely different:
 When jar with custom parser and custom-mimetypes.xml is added to solr.war - 
 both resources are located and loaded 
 (META-INF\services\org.apache.tika.parser.Parser and 
 org\apache\tika\mime\custom-mimetypes.xml) and everything works fine.
 When jar with custom parser is in Solr core lib and configured in 
 solrconfig.xml - only META-INF\services\org.apache.tika.parser.Parser is 
 loaded, but custom-mimetypes.xml is ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1145) classloaders issue loading resources when extending Tika

2013-07-04 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699883#comment-13699883
 ] 

Uwe Schindler commented on TIKA-1145:
-

OK, I misunderstood the original problem. If you pass the correct config's 
class loader everywhere where TIKA uses ServiceLoader or looks up resources 
otherwise, it should be fine.

 classloaders issue loading resources when extending Tika
 

 Key: TIKA-1145
 URL: https://issues.apache.org/jira/browse/TIKA-1145
 Project: Tika
  Issue Type: Bug
  Components: config, mime
Affects Versions: 1.3
 Environment: Tika as part of standard Solr distribution
Reporter: Maciej Lizewski

 I noticed that ServiceLoader is using different classloader when loading 
 'services' like Parsers, etc (java.net.FactoryURLClassLoader) than 
 MimeTypesFactory (org.eclipse.jetty.webapp.WebAppClassLoader) when loading 
 mime types definitions. As result - it works completely different:
 When jar with custom parser and custom-mimetypes.xml is added to solr.war - 
 both resources are located and loaded 
 (META-INF\services\org.apache.tika.parser.Parser and 
 org\apache\tika\mime\custom-mimetypes.xml) and everything works fine.
 When jar with custom parser is in Solr core lib and configured in 
 solrconfig.xml - only META-INF\services\org.apache.tika.parser.Parser is 
 loaded, but custom-mimetypes.xml is ignored.
 MimeTypesFactory ignores custom classLoader provided in TikaConfig and always 
 using only context provided one:
 ClassLoader cl = MimeTypesReader.class.getClassLoader();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1145) classloaders issue loading resources when extending Tika

2013-07-04 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699888#comment-13699888
 ] 

Uwe Schindler commented on TIKA-1145:
-

It is still strange that you see this behaviour: If both the JAR files of DIH 
and TIKA's JAR files and also your custom parsers are in SolrCore's lib folder, 
they share all the same classloader (the SolrCore's ResourceLoader's 
Classloader).
Problems would only exist, if the TIKA and DIH classes are in the WAR file, but 
the custom parser is in the lib or conf dir of the Solr core. In that case the 
MimeTypesFactory is only loading the classes from its own class loader (which 
is the webapp's), not the Solr ResourceLoader.
In any case, MimeTypesFactory should use the configured classloader.

 classloaders issue loading resources when extending Tika
 

 Key: TIKA-1145
 URL: https://issues.apache.org/jira/browse/TIKA-1145
 Project: Tika
  Issue Type: Bug
  Components: config, mime
Affects Versions: 1.3
 Environment: Tika as part of standard Solr distribution
Reporter: Maciej Lizewski

 I noticed that ServiceLoader is using different classloader when loading 
 'services' like Parsers, etc (java.net.FactoryURLClassLoader) than 
 MimeTypesFactory (org.eclipse.jetty.webapp.WebAppClassLoader) when loading 
 mime types definitions. As result - it works completely different:
 When jar with custom parser and custom-mimetypes.xml is added to solr.war - 
 both resources are located and loaded 
 (META-INF\services\org.apache.tika.parser.Parser and 
 org\apache\tika\mime\custom-mimetypes.xml) and everything works fine.
 When jar with custom parser is in Solr core lib and configured in 
 solrconfig.xml - only META-INF\services\org.apache.tika.parser.Parser is 
 loaded, but custom-mimetypes.xml is ignored.
 MimeTypesFactory ignores custom classLoader provided in TikaConfig and always 
 using only context provided one:
 ClassLoader cl = MimeTypesReader.class.getClassLoader();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 118 matches

Mail list logo