[jira] [Commented] (TIKA-2743) Replace com.sun.xml.bind:jaxb-impl and jaxb-core by org.glassfish.jaxb:jaxb-runtime and jaxb-core
[ https://issues.apache.org/jira/browse/TIKA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681602#comment-16681602 ] Uwe Schindler commented on TIKA-2743: - bq. Tim Allison shouldn't jaxb-runtime have runtime, rather than compile scope? If we don't need runtime details, yes. But weren't we talking about a direct dependency to the "com.sun" classes, which are now in glassfish namespace. If we require that at compile time, it must be a compile dependency. bq. License should work? CDDL 1.1 CDDL license is fine. But license and copyright must be mentioned in the NOTICE file! See Apache License guidelines. > Replace com.sun.xml.bind:jaxb-impl and jaxb-core by > org.glassfish.jaxb:jaxb-runtime and jaxb-core > - > > Key: TIKA-2743 > URL: https://issues.apache.org/jira/browse/TIKA-2743 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.19 >Reporter: Thomas Mortagne >Assignee: Tim Allison >Priority: Major > Fix For: 2.0.0, 1.19.1 > > > com.sun.xml.bind:* is actually the old name and is currently a repackaging of > org.glassfish.jaxb:*. probably kept as a retro compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)
[ https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604643#comment-16604643 ] Uwe Schindler commented on TIKA-2722: - bq. I reported it to Oracle using their normal channel for reporting bugs. Once you get the internal ID, send it to Rory, helps to speedup. Especially as this is shortly before the relesae. IMHO thats a real bug and should be fixed before release! Not sure about their priority internals :-) > Don't call Date.toString (Possible issue with JDK 11) > - > > Key: TIKA-2722 > URL: https://issues.apache.org/jira/browse/TIKA-2722 > Project: Tika > Issue Type: Bug > Environment: Tika 1.18, JDK 11 with locale set to "ar-EG". >Reporter: David Smiley >Priority: Major > > I'm troubleshooting [a test failure in Apache > Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] > "extracting" contrib that occurs in JDK 11 with locale "ar-EG". JDK 8 & 9 > passes; I don't know about JDK 10. It has to do with extracting date metadata > from a PDF, particularly the created date but perhaps others too. > I stepped through the code into Tika and I think I've found out where the > troublesome code is. First note PDFParser line 271: {{addMetadata(metadata, > "created", info.getCreationDate());}}. That addMetadata overload variant > will call toString on a Date. IMO that's asking for trouble since the output > of that is Locale-dependent. I think that's okay to show to a user but not > for machine-to-machine information exchange. In the case of the test, it > yielded this odd looking date string: > Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008 > I pasted that in and it looks consistent with what I see in IntelliJ and in > Jenkins logs; hopefully will post correctly to JIRA. The odd part is the > hour & minutes relative to GMT. I won't be certain until after I click > "Create". > Perhaps this problem is also indicative of a JDK 11 bug? Nevertheless I > think Tika should avoid calling Date.toString(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)
[ https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604641#comment-16604641 ] Uwe Schindler commented on TIKA-2722: - Cool thanks for the reproducer. That's indeed a bug, as you explicitely set locale on the call to {{getDisplayName()}}. It still uses default timezone to return the value. BUG! > Don't call Date.toString (Possible issue with JDK 11) > - > > Key: TIKA-2722 > URL: https://issues.apache.org/jira/browse/TIKA-2722 > Project: Tika > Issue Type: Bug > Environment: Tika 1.18, JDK 11 with locale set to "ar-EG". >Reporter: David Smiley >Priority: Major > > I'm troubleshooting [a test failure in Apache > Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] > "extracting" contrib that occurs in JDK 11 with locale "ar-EG". JDK 8 & 9 > passes; I don't know about JDK 10. It has to do with extracting date metadata > from a PDF, particularly the created date but perhaps others too. > I stepped through the code into Tika and I think I've found out where the > troublesome code is. First note PDFParser line 271: {{addMetadata(metadata, > "created", info.getCreationDate());}}. That addMetadata overload variant > will call toString on a Date. IMO that's asking for trouble since the output > of that is Locale-dependent. I think that's okay to show to a user but not > for machine-to-machine information exchange. In the case of the test, it > yielded this odd looking date string: > Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008 > I pasted that in and it looks consistent with what I see in IntelliJ and in > Jenkins logs; hopefully will post correctly to JIRA. The odd part is the > hour & minutes relative to GMT. I won't be certain until after I click > "Create". > Perhaps this problem is also indicative of a JDK 11 bug? Nevertheless I > think Tika should avoid calling Date.toString(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)
[ https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604516#comment-16604516 ] Uwe Schindler commented on TIKA-2722: - [~dsmiley]: I think this is a bug in Java 11. I know there were some changes with formatting time zones. According to their docs, the timezones are now printed according to the selected locale, if none given, the default one. This is fine in most cases, but seems to affect locales where the digits are different (non-ascii). Previously timezones that have no name (numeric only) seem to have been printed in ASCII digits. Nevertheless, only the timezone is printed with locale dependent digits, not the date itsself (reason: no date formatter is used, it just concats integers to format the date in toString for compatibility reasons). Did you send Rory O'Donnel a note, he can speedup assigning the JDK issue ID?! IMHO: TIKA should stop using java.util.Date and should go for java.time APIs, maybe start with using Instant instead of Date. > Don't call Date.toString (Possible issue with JDK 11) > - > > Key: TIKA-2722 > URL: https://issues.apache.org/jira/browse/TIKA-2722 > Project: Tika > Issue Type: Bug > Environment: Tika 1.18, JDK 11 with locale set to "ar-EG". >Reporter: David Smiley >Priority: Major > > I'm troubleshooting [a test failure in Apache > Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] > "extracting" contrib that occurs in JDK 11 with locale "ar-EG". JDK 8 & 9 > passes; I don't know about JDK 10. It has to do with extracting date metadata > from a PDF, particularly the created date but perhaps others too. > I stepped through the code into Tika and I think I've found out where the > troublesome code is. First note PDFParser line 271: {{addMetadata(metadata, > "created", info.getCreationDate());}}. That addMetadata overload variant > will call toString on a Date. IMO that's asking for trouble since the output > of that is Locale-dependent. I think that's okay to show to a user but not > for machine-to-machine information exchange. In the case of the test, it > yielded this odd looking date string: > Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008 > I pasted that in and it looks consistent with what I see in IntelliJ and in > Jenkins logs; hopefully will post correctly to JIRA. The odd part is the > hour & minutes relative to GMT. I won't be certain until after I click > "Create". > Perhaps this problem is also indicative of a JDK 11 bug? Nevertheless I > think Tika should avoid calling Date.toString(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2667) Upgrade jmatio to 1.4
[ https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490 ] Uwe Schindler edited comment on TIKA-2667 at 6/20/18 7:04 PM: -- It's OK because it wont fail, but I dont understand the need to catch Throwable and the reason to use AtomicReference. The doPrivileged part cannot throw any exception it will always succeed, all exceptions are handled internally! Do privileged is not risky as it does not do something like "sudo" (the name of method is misleading). It just executes the stuff inside the lambda with the privileges of the current code base (and that's documented to be always possible, because we call the method ourselves). Without the doPrivileged it would call the stuff with caller's privileges. The doPrivileged call is there to allow user of the JAR to configure the JVM that only our JAR file can do the privileged action. This improves security, because you don't need to give everyone the permission to call setAccesible() and access Unsafe. It's only important to NOT give the MethodHandle to untrusted code, so it must be "private final". So just copypaste the whole code from Lucene's MMAP directory - the static initializer (maybe do code a bit different with error reporting, Lucene uses no logging): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336 The return value of the method reference to the private unmapper method is Object to allow to pass a String or a MethodHandle through the return value of the privileged block: The code you added using the AtomicReference is not needed - exactly because the privileged code returns either a method handle OR it returns an error message (that's the trick). The resourceDescription is used to make the exception more meaningful (in Lucene we use filename, so user get's an error about what file handle caused the issue). This trycatch in the code is obsolete: https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395 The BufferCleaner interface just throws an IOException if unmapping goes wrong - with a meaningful error message. So I'd remove the try-catch block, it's legacy. Maybe I should create a Pull Request? Unfortunately I have no time and no checkout of the matfile reader ready was (Author: thetaphi): It's OK because it wont fail, but I dont understand the need to catch Throwable and the reason to use AtomicReference. The doPrivileged part cannot throw any exception it will always succeed, all exceptions are handled internally! Do privileged is not risky as it does not do something like "sudo" (the name of method is misleading). It just executes the stuff inside the lambda with the privileges of the current code base (and that's documented to be always possible, because we call the method ourselves). Without the doPrivileged it would call the stuff with caller's privileges. The doPrivileged call is there to allow user of the JAR to configure the JVM that only our JAR file can do the privileged action. This improves security, because you don't need to give everyone the permission to call setAccesible() and access Unsafe. So just copypaste the whole code from Lucene's MMAP directory - the static initializer (maybe do code a bit different with error reporting, Lucene uses no logging): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336 The return value of the method reference to the private unmapper method is Object to allow to pass a String or a MethodHandle through the return value of the privileged block: The code you added using the AtomicReference is not needed - exactly because the privileged code returns either a method handle OR it returns an error message (that's the trick). The resourceDescription is used to make the exception more meaningful (in Lucene we use filename, so user get's an error about what file handle caused the issue). This trycatch in the code is obsolete: https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395 The BufferCleaner interface just throws an IOException if unmapping goes wrong - with a meaningful error message. So I'd remove the try-catch block, it's legacy. Maybe I should create a Pull Request? Unfortunately I have no time and no checkout of the matfile reader ready > Upgrade jmatio to 1.4 > -- > > Key: TIKA-2667 > URL: https://issues.apache.org/jira/browse/TIKA-2667 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, > thanks to a copy/paste from Lucene. >
[jira] [Comment Edited] (TIKA-2667) Upgrade jmatio to 1.4
[ https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490 ] Uwe Schindler edited comment on TIKA-2667 at 6/20/18 7:01 PM: -- It's OK because it wont fail, but I dont understand the need to catch Throwable and the reason to use AtomicReference. The doPrivileged part cannot throw any exception it will always succeed, all exceptions are handled internally! Do privileged is not risky as it does not do something like "sudo" (the name of method is misleading). It just executes the stuff inside the lambda with the privileges of the current code base (and that's documented to be always possible, because we call the method ourselves). Without the doPrivileged it would call the stuff with caller's privileges. The doPrivileged call is there to allow user of the JAR to configure the JVM that only our JAR file can do the privileged action. This improves security, because you don't need to give everyone the permission to call setAccesible() and access Unsafe. So just copypaste the whole code from Lucene's MMAP directory - the static initializer (maybe do code a bit different with error reporting, Lucene uses no logging): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336 The return value of the method reference to the private unmapper method is Object to allow to pass a String or a MethodHandle through the return value of the privileged block: The code you added using the AtomicReference is not needed - exactly because the privileged code returns either a method handle OR it returns an error message (that's the trick). The resourceDescription is used to make the exception more meaningful (in Lucene we use filename, so user get's an error about what file handle caused the issue). This trycatch in the code is obsolete: https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395 The BufferCleaner interface just throws an IOException if unmapping goes wrong - with a meaningful error message. So I'd remove the try-catch block, it's legacy. Maybe I should create a Pull Request? Unfortunately I have no time and no checkout of the matfile reader ready was (Author: thetaphi): It's OK because it wont fail, but I dont understand the need to catch Throwable and the reason to use AtomicReference. The doPrivileged part cannot throw any exception it will always succeed, all exceptions are handled internally! Do privileged is not risky as it does not do something like "sudo" (the name of method is misleading). It just executes the stuff inside the lambda with the privileges of the current code base (and that's documented to be always possible, because we call the method ourselves). Without the doPrivileged it would call the stuff with caller's privileges. The doPrivileged call is there to allow user of the JAR to configure the JVM that only our JAR file can do the privileged action. This improves security, because you don't need to give everyone the permission to call setAccesible() and access Unsafe. So just copypaste the whole code from Lucene's MMAP directory - the static initializer (maybe do code a bit different with error reporting, Lucene uses no logging): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336 The return value of the method reference to the private unmapper method is Object to allow to pass a String or a MethodHandle through the return value of the privileged block: The code you added using the AtomicReference is not needed - exactly because the privileged code returns either a method handle OR it returns an error message (that's the trick). The resourceDescription is used to make the exception more meaningful (in Lucene we use filename, so user get's an error about what file handle caused the issue). This code is obsolete: https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395 The BufferCleaner interface just throws an IOException if unmapping goes wrong - with a meaningful error message. So I'd remove the try-catch block, it's legacy. Maybe I should create a Pull Request? Unfortunately I have no time and no checkout of the matfile reader ready > Upgrade jmatio to 1.4 > -- > > Key: TIKA-2667 > URL: https://issues.apache.org/jira/browse/TIKA-2667 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, > thanks to a copy/paste from Lucene. > jmatio 1.4 will include one that actually works. Thank you, [~thetaphi]! -- This message was sent by Atlassian
[jira] [Comment Edited] (TIKA-2667) Upgrade jmatio to 1.4
[ https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490 ] Uwe Schindler edited comment on TIKA-2667 at 6/20/18 7:00 PM: -- It's OK because it wont fail, but I dont understand the need to catch Throwable and the reason to use AtomicReference. The doPrivileged part cannot throw any exception it will always succeed, all exceptions are handled internally! Do privileged is not risky as it does not do something like "sudo" (the name of method is misleading). It just executes the stuff inside the lambda with the privileges of the current code base (and that's documented to be always possible, because we call the method ourselves). Without the doPrivileged it would call the stuff with caller's privileges. The doPrivileged call is there to allow user of the JAR to configure the JVM that only our JAR file can do the privileged action. This improves security, because you don't need to give everyone the permission to call setAccesible() and access Unsafe. So just copypaste the whole code from Lucene's MMAP directory - the static initializer (maybe do code a bit different with error reporting, Lucene uses no logging): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336 The return value of the method reference to the private unmapper method is Object to allow to pass a String or a MethodHandle through the return value of the privileged block: The code you added using the AtomicReference is not needed - exactly because the privileged code returns either a method handle OR it returns an error message (that's the trick). The resourceDescription is used to make the exception more meaningful (in Lucene we use filename, so user get's an error about what file handle caused the issue). This code is obsolete: https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395 The BufferCleaner interface just throws an IOException if unmapping goes wrong - with a meaningful error message. So I'd remove the try-catch block, it's legacy. Maybe I should create a Pull Request? Unfortunately I have no time and no checkout of the matfile reader ready was (Author: thetaphi): It's OK because it wont fail, but I dont understand the need to catch Throwable and the reason to use AtomicReference. The doPrivileged part cannot throw any exception it will always succeed, all exceptions are handled internally! Do privileged is not risky as it does not do something like "sudo" (the name of method is misleading). It just executes the stuff inside the lambda with the privileges of the current code base (ant that's always possible, because we call the method). Without the doPrivileged it would call the stuff with caller's privileges. The doPrivileged call is there to allow user of the JAR to configure the JVM that only our JAR file can do the privileged action. This improves security, because you don't need to give everyone the permission to call setAccesible() and access Unsafe. So just copypaste the whole code from Lucene's MMAP directory - the static initializer (maybe do code a bit different with error reporting, Lucene uses no logging): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336 The return value of the method reference to the private unmapper method is Object to allow to pass a String or a MethodHandle through the return value of the privileged block: The code you added using the AtomicReference is not needed - exactly because the privileged code returns either a method handle OR it returns an error message (that's the trick). The resourceDescription is used to make the exception more meaningful (in Lucene we use filename, so user get's an error about what file handle caused the issue). This code is obsolete: https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395 The BufferCleaner interface just throws an IOException if unmapping goes wrong - with a meaningful error message. So I'd remove the try-catch block, it's legacy. Maybe I should create a Pull Request? Unfortunately I have no time and no checkout of the matfile reader ready > Upgrade jmatio to 1.4 > -- > > Key: TIKA-2667 > URL: https://issues.apache.org/jira/browse/TIKA-2667 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, > thanks to a copy/paste from Lucene. > jmatio 1.4 will include one that actually works. Thank you, [~thetaphi]! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2667) Upgrade jmatio to 1.4
[ https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518490#comment-16518490 ] Uwe Schindler commented on TIKA-2667: - It's OK because it wont fail, but I dont understand the need to catch Throwable and the reason to use AtomicReference. The doPrivileged part cannot throw any exception it will always succeed, all exceptions are handled internally! Do privileged is not risky as it does not do something like "sudo" (the name of method is misleading). It just executes the stuff inside the lambda with the privileges of the current code base (ant that's always possible, because we call the method). Without the doPrivileged it would call the stuff with caller's privileges. The doPrivileged call is there to allow user of the JAR to configure the JVM that only our JAR file can do the privileged action. This improves security, because you don't need to give everyone the permission to call setAccesible() and access Unsafe. So just copypaste the whole code from Lucene's MMAP directory - the static initializer (maybe do code a bit different with error reporting, Lucene uses no logging): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L312-L336 The return value of the method reference to the private unmapper method is Object to allow to pass a String or a MethodHandle through the return value of the privileged block: The code you added using the AtomicReference is not needed - exactly because the privileged code returns either a method handle OR it returns an error message (that's the trick). The resourceDescription is used to make the exception more meaningful (in Lucene we use filename, so user get's an error about what file handle caused the issue). This code is obsolete: https://github.com/tballison/jmatio/blob/master/src/main/java/com/jmatio/io/MatFileReader.java#L376-L395 The BufferCleaner interface just throws an IOException if unmapping goes wrong - with a meaningful error message. So I'd remove the try-catch block, it's legacy. Maybe I should create a Pull Request? Unfortunately I have no time and no checkout of the matfile reader ready > Upgrade jmatio to 1.4 > -- > > Key: TIKA-2667 > URL: https://issues.apache.org/jira/browse/TIKA-2667 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, > thanks to a copy/paste from Lucene. > jmatio 1.4 will include one that actually works. Thank you, [~thetaphi]! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2667) Upgrade jmatio to 1.3
[ https://issues.apache.org/jira/browse/TIKA-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512016#comment-16512016 ] Uwe Schindler commented on TIKA-2667: - Hi, I just looked at your code change in jmatio. The AccessController.doPrivileged() needs to be around the static initializer because the setAccessible(true) is now done there (early). While calling the "compiled" cleaner, it can then be sure that it works. Do you think it is a good idea to throw runtime exception in the initializer if it fails? This is too risky, what happens if somebody uses a too new JDK? On the place where it actually calls the created cleaner instance, no doPrivileged is needed (it's already in the implementation, so done 2 times). Should I open a bug on your fork? Uwe > Upgrade jmatio to 1.3 > -- > > Key: TIKA-2667 > URL: https://issues.apache.org/jira/browse/TIKA-2667 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > jmatio 1.3 includes an upgrade to clean MappedByteBuffers in Java 8->11-ea, > thanks to a copy/paste from Lucene. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096384#comment-15096384 ] Uwe Schindler commented on TIKA-1830: - It would be good to update to 1.8.11 as soon as it is out, because Lucene/Solr is affected by PDFBOX-3155: we are testing Java 9 preview builds, and that failed because of this bug. For now we disabled the tests around TIKA when running with Java 9. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096663#comment-15096663 ] Uwe Schindler commented on TIKA-1830: - bq. Speaking of integration with Solr, would you have a chance/any interest in offering feedback on our initial restructuring of the parser bundles for Tika 2.0 (TIKA-1824)? Or more generally, do you and your Solr colleagues have any wishes for the 2.0 roadmap? As already stated in the past, we would like to only bundle parsers for text document formats, because images, class files or else are not really useful for indexing by default. Users that want to do this, can still add the missing parser bundles and SPI will do the rest. Currently we have disabled some parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI will disable them automatically (because of ClassNotFoundEx). This was a bit rude, but worked. The reason for this was partly also some version incompatibilities (ASM was old in TIKA, Lucene needs newest one), but ASM is not really useful for indexing anyways! In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR file which one gets bundled, so we check every release anyways during update. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096668#comment-15096668 ] Uwe Schindler commented on TIKA-1824: - Hi, as invited on TIKA-1830, here some comments from Apache Solr: {quote} As already stated in the past, we would like to only bundle parsers for text document formats, because images, class files or else are not really useful for indexing by default. Users that want to do this, can still add the missing parser bundles and SPI will do the rest. Currently we have disabled some parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI will disable them automatically (because of ClassNotFoundEx). This was a bit rude, but worked. The reason for this was partly also some version incompatibilities (ASM was old in TIKA, Lucene needs newest one), but ASM is not really useful for indexing anyways! In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR file which one gets bundled, so we check every release anyways during update. {quote} In addition, it would be a good idea to allow loading the TIKA SPI files in a separate classloader (to isolate the parser classes from others). The reason for this is JAR hell. If TIKA would load the parsers in its own classloader (optionally, e.g. by configuration), we could place all parsers and their dependencies in a separate lib directory outside the Solr's lib folder. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path
[ https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1758: Description: All tests for CLI module fail with errors like that: {noformat} Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL ineTest testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest) Time elapsed: 0.026 sec <<< ERROR! java.nio.file.InvalidPathException: Illegal char <"> at index 0: "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput" at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182) at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153) at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77) at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94) at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255) at java.nio.file.Paths.get(Paths.java:84) at org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137) at org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51) at org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127) {noformat} The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? If you use ProcessBuilder you don't need that! Not sure what this should do, but the problem is: The first argument (the executable) contains quotes after the method transformed it and breaks the test. I have no idea how to fix this, but the quotes should not be in a String[] command line at all. was: All tests for CLI module fail with errors like that: {noformat} Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL ineTest testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest) Time elapsed: 0.026 sec <<< ERROR! java.nio.file.InvalidPathException: Illegal char <"> at index 0: "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput" at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182) at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153) at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77) at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94) at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255) at java.nio.file.Paths.get(Paths.java:84) at org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137) at org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51) at org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127) {noformat} The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? If you use ProcessBuilder you don't need that! Not sure what this should do, but the problem is: The first argument (the executable) contains quotes afterwards and breaks the test. I have no idea how to fix this, but the quotes should not be in a String[] command line at all. > BatchCommandLineBuilder fails on systems with whitespace in path > > > Key: TIKA-1758 > URL: https://issues.apache.org/jira/browse/TIKA-1758 > Project: Tika > Issue Type: Bug > Components: cli >Reporter: Uwe Schindler > > All tests for CLI module fail with errors like that: > {noformat} > Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< > FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL > ineTest > testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest) Time > elapsed: 0.026 sec <<< ERROR! > java.nio.file.InvalidPathException: Illegal char <"> at index 0: > "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput" > at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182) > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153) > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77) > at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94) > at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255) > at java.nio.file.Paths.get(Paths.java:84) > at > org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137) > at > org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51) > at > org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127) > {noformat} > The reason is that BatchCommandLineBuilder adds quotes for unknown
[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name
[ https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938915#comment-14938915 ] Uwe Schindler commented on TIKA-1757: - The other issue is different, I opened TIKA-1758 > tika-batch tests fail on systems with whitespace or special chars in folder > name > > > Key: TIKA-1757 > URL: https://issues.apache.org/jira/browse/TIKA-1757 > Project: Tika > Issue Type: Bug >Reporter: Uwe Schindler >Assignee: Tim Allison > Attachments: TIKA-1757.patch > > > This is one problem that forbiddenapis des not catch, because the method > affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both > return the URL path, which should never be treated as a file system path (for > file: URLs). This is breaks asap, if the path contains special characters > which may not be part of URL. getFile() and getPath() return the encoded path. > The correct way to transform a file URL to a file is: {{new > File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven > community for Mojos/Plugins. > In fact the affected test should not use a file at all. Instead it should use > {{Class#getResourceAsStream()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name
Uwe Schindler created TIKA-1757: --- Summary: tika-batch tests fail on systems with whitespace or special chars in folder name Key: TIKA-1757 URL: https://issues.apache.org/jira/browse/TIKA-1757 Project: Tika Issue Type: Bug Reporter: Uwe Schindler This is one problem that forbiddenapis des not catch, because the method affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both return the URL path, which should never be treated as a file system path (for file: URLs). This is breaks asap, if the path contains special characters which may not be part of URL. getFile() and getPath() return the encoded path. The correct way to transform a file URL to a file is: {{new File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven community for Mojos/Plugins. In fact the affected test should not use a file at all. Instead it should use {{Class#getResourceAsStream()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1756) Update forbiddenapis to v2.0
Uwe Schindler created TIKA-1756: --- Summary: Update forbiddenapis to v2.0 Key: TIKA-1756 URL: https://issues.apache.org/jira/browse/TIKA-1756 Project: Tika Issue Type: Improvement Reporter: Uwe Schindler Forbiddenapis 2.0 was released a few hours ago. Apache POI and Lucene already updated, Tika should do this, too. Attached is a patch. {quote} The main new feature is native support for the Gradle build system (minimum requirement is Gradle 2.3). But also Apache Ant and Apache Maven build systems got improved support: Ant can now load signatures from arbitrary resources by using a new XML element that may contain any valid ANT resource, e.g., ivy's cache-filesets or plain URLs. Apache Maven now supports to load signatures files as artifacts from your repository or Maven Central (new signaturesArtifacts Mojo property). Breaking changes: - Update to Java 6 as minimum requirement. - Switch default Maven lifecycle phase to verify. Bug fixes: - Add automatic plugin execution override for M2E. It is no longer needed to add a lifecycle mapping to exclude forbiddenapis to execute inside Eclipse's M2E {quote} The M2E change is nice, because you no longer need the M2E workaround to disable running the plugin in Eclipse manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path
Uwe Schindler created TIKA-1758: --- Summary: BatchCommandLineBuilder fails on systems with whitespace in path Key: TIKA-1758 URL: https://issues.apache.org/jira/browse/TIKA-1758 Project: Tika Issue Type: Bug Components: cli Reporter: Uwe Schindler All tests for CLI module fail with errors like that: {noformat} Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL ineTest testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest) Time elapsed: 0.026 sec <<< ERROR! java.nio.file.InvalidPathException: Illegal char <"> at index 0: "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput" at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182) at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153) at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77) at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94) at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255) at java.nio.file.Paths.get(Paths.java:84) at org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137) at org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51) at org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127) {noformat} The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? If you use ProcessBuilder you don't need that! Not sure what this should do, but the problem is: The first argument (the executable) contains quotes afterwards and breaks the test. I have no idea how to fix this, but the quotes should not be in a String[] command line at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name
[ https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938906#comment-14938906 ] Uwe Schindler commented on TIKA-1757: - Please wait with committing there are more tests failing with similar problems: Now tika-app, in this case some unneeded quoting. > tika-batch tests fail on systems with whitespace or special chars in folder > name > > > Key: TIKA-1757 > URL: https://issues.apache.org/jira/browse/TIKA-1757 > Project: Tika > Issue Type: Bug >Reporter: Uwe Schindler >Assignee: Tim Allison > Attachments: TIKA-1757.patch > > > This is one problem that forbiddenapis des not catch, because the method > affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both > return the URL path, which should never be treated as a file system path (for > file: URLs). This is breaks asap, if the path contains special characters > which may not be part of URL. getFile() and getPath() return the encoded path. > The correct way to transform a file URL to a file is: {{new > File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven > community for Mojos/Plugins. > In fact the affected test should not use a file at all. Instead it should use > {{Class#getResourceAsStream()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1756) Update forbiddenapis to v2.0
[ https://issues.apache.org/jira/browse/TIKA-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1756: Attachment: TIKA-1756.patch > Update forbiddenapis to v2.0 > > > Key: TIKA-1756 > URL: https://issues.apache.org/jira/browse/TIKA-1756 > Project: Tika > Issue Type: Improvement >Reporter: Uwe Schindler > Attachments: TIKA-1756.patch > > > Forbiddenapis 2.0 was released a few hours ago. Apache POI and Lucene already > updated, Tika should do this, too. > Attached is a patch. > {quote} > The main new feature is native support for the Gradle build system (minimum > requirement is Gradle 2.3). But also Apache Ant and Apache Maven build > systems got improved support: Ant can now load signatures from arbitrary > resources by using a new XML element that may > contain any valid ANT resource, e.g., ivy's cache-filesets or plain URLs. > Apache Maven now supports to load signatures files as artifacts from your > repository or Maven Central (new signaturesArtifacts Mojo property). > Breaking changes: > - Update to Java 6 as minimum requirement. > - Switch default Maven lifecycle phase to verify. > Bug fixes: > - Add automatic plugin execution override for M2E. It is no longer needed to > add a lifecycle mapping to exclude forbiddenapis to execute inside Eclipse's > M2E > {quote} > The M2E change is nice, because you no longer need the M2E workaround to > disable running the plugin in Eclipse manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1756) Update forbiddenapis to v2.0
[ https://issues.apache.org/jira/browse/TIKA-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938879#comment-14938879 ] Uwe Schindler commented on TIKA-1756: - While testing this I found out that TIKA's test break when running with whitespace in folder name (windows user name with whitespace). But this is unrelated to this one. The problematic method is one of the crazy stuff that may be put on the forbidden list (Lucene does this): {{URL#getPath()}} is bad, bad, bad if used to generate a file name. Must be {{new File(url.toURI())}} > Update forbiddenapis to v2.0 > > > Key: TIKA-1756 > URL: https://issues.apache.org/jira/browse/TIKA-1756 > Project: Tika > Issue Type: Improvement >Reporter: Uwe Schindler > Attachments: TIKA-1756.patch > > > Forbiddenapis 2.0 was released a few hours ago. Apache POI and Lucene already > updated, Tika should do this, too. > Attached is a patch. > {quote} > The main new feature is native support for the Gradle build system (minimum > requirement is Gradle 2.3). But also Apache Ant and Apache Maven build > systems got improved support: Ant can now load signatures from arbitrary > resources by using a new XML element that may > contain any valid ANT resource, e.g., ivy's cache-filesets or plain URLs. > Apache Maven now supports to load signatures files as artifacts from your > repository or Maven Central (new signaturesArtifacts Mojo property). > Breaking changes: > - Update to Java 6 as minimum requirement. > - Switch default Maven lifecycle phase to verify. > Bug fixes: > - Add automatic plugin execution override for M2E. It is no longer needed to > add a lifecycle mapping to exclude forbiddenapis to execute inside Eclipse's > M2E > {quote} > The M2E change is nice, because you no longer need the M2E workaround to > disable running the plugin in Eclipse manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name
[ https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1757: Attachment: TIKA-1757.patch Patch for broken test. > tika-batch tests fail on systems with whitespace or special chars in folder > name > > > Key: TIKA-1757 > URL: https://issues.apache.org/jira/browse/TIKA-1757 > Project: Tika > Issue Type: Bug >Reporter: Uwe Schindler >Assignee: Tim Allison > Attachments: TIKA-1757.patch > > > This is one problem that forbiddenapis des not catch, because the method > affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both > return the URL path, which should never be treated as a file system path (for > file: URLs). This is breaks asap, if the path contains special characters > which may not be part of URL. getFile() and getPath() return the encoded path. > The correct way to transform a file URL to a file is: {{new > File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven > community for Mojos/Plugins. > In fact the affected test should not use a file at all. Instead it should use > {{Class#getResourceAsStream()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name
[ https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938917#comment-14938917 ] Uwe Schindler commented on TIKA-1757: - bq. If one needs a java.nio.file.Path, Paths.get(url.toURI()) can be used instead. Of course. But in the affected test using a file just to open an InputStream was wrong anyways. So I fixed it by completely removing any File/Path usage. > tika-batch tests fail on systems with whitespace or special chars in folder > name > > > Key: TIKA-1757 > URL: https://issues.apache.org/jira/browse/TIKA-1757 > Project: Tika > Issue Type: Bug >Reporter: Uwe Schindler >Assignee: Tim Allison > Attachments: TIKA-1757.patch > > > This is one problem that forbiddenapis des not catch, because the method > affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both > return the URL path, which should never be treated as a file system path (for > file: URLs). This is breaks asap, if the path contains special characters > which may not be part of URL. getFile() and getPath() return the encoded path. > The correct way to transform a file URL to a file is: {{new > File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven > community for Mojos/Plugins. > In fact the affected test should not use a file at all. Instead it should use > {{Class#getResourceAsStream()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1714) Consider making default host for Tika Server 0.0.0.0 instead of localhost
[ https://issues.apache.org/jira/browse/TIKA-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701435#comment-14701435 ] Uwe Schindler commented on TIKA-1714: - If you want to bind for all, don't use 0.0.0.0, because this is IPv4 only (won't work with IPv6). To bind to all, remove the whole ip adress setting in the socket config. It then binds to IPv4 and also IPv6 depending on availablibity, Consider making default host for Tika Server 0.0.0.0 instead of localhost - Key: TIKA-1714 URL: https://issues.apache.org/jira/browse/TIKA-1714 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.10 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.11 I noticed in Tika-Python on Windows while fixing some bugs that by default Tika Server binds to localhost which means that the Tika Server running on Windows isn't available to external hosts trying to access it on host name:9998. I think the default behavior is that it *should* be available externally, meaning, we should probably bind to the special address, 0.0.0,0 which binds to all interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1714) Consider making default host for Tika Server 0.0.0.0 instead of localhost
[ https://issues.apache.org/jira/browse/TIKA-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701442#comment-14701442 ] Uwe Schindler commented on TIKA-1714: - In any case, I agree with Nick, we should not do this. Maybe allow to bind to all addresses using a System property or allow it to be configured. I have a lot of machines with multipe IP adresses, and I want external services only bind to one specific - so default should be ::1 / 127.0.0.1 or any ip adress the user passes as command line option / system property (like {{-Djetty.host=XXX -Djetty.port=XXX}}). The use is then also open to bind to IPv4 and/or IPv6 on his own. Consider making default host for Tika Server 0.0.0.0 instead of localhost - Key: TIKA-1714 URL: https://issues.apache.org/jira/browse/TIKA-1714 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.10 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.11 I noticed in Tika-Python on Windows while fixing some bugs that by default Tika Server binds to localhost which means that the Tika Server running on Windows isn't available to external hosts trying to access it on host name:9998. I think the default behavior is that it *should* be available externally, meaning, we should probably bind to the special address, 0.0.0,0 which binds to all interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698313#comment-14698313 ] Uwe Schindler commented on TIKA-1706: - Yes, you can add the maven property {{failOnUnresolvableSignaturesfalse/failOnUnresolvableSignatures to the plugin configuration}}: [http://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/check-mojo.html#failOnUnresolvableSignatures] An alternative is to only enable commons-io-unsafe-2.4 only for those modules where its used, unfortunately this is not so easy, because you cannot inherit only some array values to submodules, you miust reconfigure all bundledsignatures in submodules. Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697961#comment-14697961 ] Uwe Schindler commented on TIKA-1706: - If you bring in commons-io, you should also add the corresponding forbidden-apis signatures to the POM. commons-io makes it easy to choose the wrong IOUtils/FileUtils method and then you are dependent to default charset again... https://github.com/policeman-tools/forbidden-apis/wiki/BundledSignatures Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4
[ https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1705: Attachment: TIKA-1705-2.patch Sorry for a second patch. I just noticed that you were using asm-debug-all.jar instead of plain simple asm.jar. As this is a very basic parser, the asm-commons parts or helper visitors are not needed, so we should fallback to plain asm (also for compatibility with other projects). The -debug stuff was previously used because of generics warnings in earlier versions (they stripped off generics from JAR file), but this is no longer an issue. So please apply this patch, too :-) Update ASM dependency to 5.0.4 -- Key: TIKA-1705 URL: https://issues.apache.org/jira/browse/TIKA-1705 Project: Tika Issue Type: Task Affects Versions: 1.7 Reporter: Uwe Schindler Assignee: Dave Meikle Fix For: 1.11 Attachments: TIKA-1705-2.patch, TIKA-1705.patch Currently the Class file parser uses ASM 4.1. This older version cannot read Java 8 / Java 9 class files (fails with Exception). The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The code change is only to update the visitor version, so it gets new Java 8 features like lambdas reported, but this is not really required, but should be done for full support. FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 5, too. You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no problem with Lucene using a newer version). Since ASM 4.x the updates are more easy (no visitor interfaces anymore, instead abstract classes), so it does not break if you just replace the JAR file. So just see this as a recommendatation, not urgent! Solr/Lucene will also work without this patch (it just replaces the shipped ASM by newer version in our packaging). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (TIKA-1705) Update ASM dependency to 5.0.4
[ https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened TIKA-1705: - Reopen for 2nd patch. Update ASM dependency to 5.0.4 -- Key: TIKA-1705 URL: https://issues.apache.org/jira/browse/TIKA-1705 Project: Tika Issue Type: Task Affects Versions: 1.7 Reporter: Uwe Schindler Assignee: Dave Meikle Fix For: 1.11 Attachments: TIKA-1705-2.patch, TIKA-1705.patch Currently the Class file parser uses ASM 4.1. This older version cannot read Java 8 / Java 9 class files (fails with Exception). The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The code change is only to update the visitor version, so it gets new Java 8 features like lambdas reported, but this is not really required, but should be done for full support. FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 5, too. You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no problem with Lucene using a newer version). Since ASM 4.x the updates are more easy (no visitor interfaces anymore, instead abstract classes), so it does not break if you just replace the JAR file. So just see this as a recommendatation, not urgent! Solr/Lucene will also work without this patch (it just replaces the shipped ASM by newer version in our packaging). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1705) Update ASM dependency to 5.0.4
[ https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681759#comment-14681759 ] Uwe Schindler commented on TIKA-1705: - The question about this: This will not fail tests when new versions of JVMs are out. You will only find that problem when new class files are added In my opinion, a good test would also be to also test a class file from the local JVM (e.g., {{String.class.getResourceAsStream('String.class')}} With that test you would actually make sure that the class files of the JVM that compiles can be read! So once Java 9 is out and has a new classfile format, this would fail build if somebody runs build with this JVM. Update ASM dependency to 5.0.4 -- Key: TIKA-1705 URL: https://issues.apache.org/jira/browse/TIKA-1705 Project: Tika Issue Type: Task Affects Versions: 1.7 Reporter: Uwe Schindler Assignee: Dave Meikle Fix For: 1.11 Attachments: TIKA-1705-2.patch, TIKA-1705.patch Currently the Class file parser uses ASM 4.1. This older version cannot read Java 8 / Java 9 class files (fails with Exception). The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The code change is only to update the visitor version, so it gets new Java 8 features like lambdas reported, but this is not really required, but should be done for full support. FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 5, too. You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no problem with Lucene using a newer version). Since ASM 4.x the updates are more easy (no visitor interfaces anymore, instead abstract classes), so it does not break if you just replace the JAR file. So just see this as a recommendatation, not urgent! Solr/Lucene will also work without this patch (it just replaces the shipped ASM by newer version in our packaging). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1705) Update ASM dependency to 5.0.4
Uwe Schindler created TIKA-1705: --- Summary: Update ASM dependency to 5.0.4 Key: TIKA-1705 URL: https://issues.apache.org/jira/browse/TIKA-1705 Project: Tika Issue Type: Task Affects Versions: 1.7 Reporter: Uwe Schindler Currently the Class file parser uses ASM 4.1. This older version cannot read Java 8 / Java 9 class files (fails with Exception). The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The code change is only to update the visitor version, so it gets new Java 8 features like lambdas reported, but this is not really required, but should be done for full support. FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 5, too. You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no problem with Lucene using a newer version). Since ASM 4.x the updates are more easy (no visitor interfaces anymore, instead abstract classes), so it does not break if you just replace the JAR file. So just see this as a recommendatation, not urgent! Solr/Lucene will also work without this patch (it just replaces the shipped ASM by newer version in our packaging). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4
[ https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1705: Attachment: TIKA-1705.patch Simple patch. All tests pass. Update ASM dependency to 5.0.4 -- Key: TIKA-1705 URL: https://issues.apache.org/jira/browse/TIKA-1705 Project: Tika Issue Type: Task Affects Versions: 1.7 Reporter: Uwe Schindler Attachments: TIKA-1705.patch Currently the Class file parser uses ASM 4.1. This older version cannot read Java 8 / Java 9 class files (fails with Exception). The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The code change is only to update the visitor version, so it gets new Java 8 features like lambdas reported, but this is not really required, but should be done for full support. FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 5, too. You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no problem with Lucene using a newer version). Since ASM 4.x the updates are more easy (no visitor interfaces anymore, instead abstract classes), so it does not break if you just replace the JAR file. So just see this as a recommendatation, not urgent! Solr/Lucene will also work without this patch (it just replaces the shipped ASM by newer version in our packaging). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency
[ https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617578#comment-14617578 ] Uwe Schindler commented on TIKA-1675: - There was already an issue/discussion open on POI mailing lists and issue tracker to no longer use xmlbeans Co, because since Java 6 the JAXB interface is a public API that allows to map XML documents to Java Beans - which is exactly the same as xmlbeans is dooing. Unfortunately this is a larger approach to change the API to do use the standards Java API (and might also bring more performance). This would remove a lot of unneeded XML-based stuff from POI for Microsoft Office 2007+ file formats. -1 to absorb the buggy xmlbeans (this lib was also the problem of the major Solr/Lucene security issue last year) +1 to adopt JAXB instead of xmlbeans please avoid xmlbeans dependency Key: TIKA-1675 URL: https://issues.apache.org/jira/browse/TIKA-1675 Project: Tika Issue Type: Bug Reporter: Robert Muir This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499 Is there an alternative that could be used? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1675) please avoid xmlbeans dependency
[ https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617578#comment-14617578 ] Uwe Schindler edited comment on TIKA-1675 at 7/7/15 10:53 PM: -- There was already an issue/discussion open on POI mailing lists and issue tracker to no longer use xmlbeans Co, because since Java 6 the JAXB interface is a public API that allows to map XML documents to Java Beans (https://jcp.org/en/jsr/detail?id=222) - which is exactly the same as xmlbeans is dooing. Unfortunately this is a larger approach to change the API to do use the standards Java API (and might also bring more performance). This would remove a lot of unneeded XML-based stuff from POI for Microsoft Office 2007+ file formats. -1 to absorb the buggy xmlbeans (this lib was also the problem of the major Solr/Lucene security issue last year) +1 to adopt JAXB instead of xmlbeans was (Author: thetaphi): There was already an issue/discussion open on POI mailing lists and issue tracker to no longer use xmlbeans Co, because since Java 6 the JAXB interface is a public API that allows to map XML documents to Java Beans - which is exactly the same as xmlbeans is dooing. Unfortunately this is a larger approach to change the API to do use the standards Java API (and might also bring more performance). This would remove a lot of unneeded XML-based stuff from POI for Microsoft Office 2007+ file formats. -1 to absorb the buggy xmlbeans (this lib was also the problem of the major Solr/Lucene security issue last year) +1 to adopt JAXB instead of xmlbeans please avoid xmlbeans dependency Key: TIKA-1675 URL: https://issues.apache.org/jira/browse/TIKA-1675 Project: Tika Issue Type: Bug Reporter: Robert Muir This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499 Is there an alternative that could be used? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency
[ https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617588#comment-14617588 ] Uwe Schindler commented on TIKA-1675: - kiwiwings kiwiwi...@apache.org already proposed this for POI: [http://apache-poi.1045710.n5.nabble.com/Re-svn-commit-r1682117-poi-site-src-documentation-content-xdocs-document-index-xml-td5718914.html#a5718928] But this is really an issue for Apache POI! please avoid xmlbeans dependency Key: TIKA-1675 URL: https://issues.apache.org/jira/browse/TIKA-1675 Project: Tika Issue Type: Bug Reporter: Robert Muir This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499 Is there an alternative that could be used? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1637) Oracle internal API jdeps request for information
[ https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558087#comment-14558087 ] Uwe Schindler commented on TIKA-1637: - Hi Dave, forbidden-apis already forbids use of internal APIs like sun.misc.Unsafe, see TIKA's parent POM: internalRuntimeForbiddentrue/internalRuntimeForbidden But indeed, we don't see usage in dependent libraries, so it would be good to run jdeps on all the millions of dependencies! :-) Oracle internal API jdeps request for information - Key: TIKA-1637 URL: https://issues.apache.org/jira/browse/TIKA-1637 Project: Tika Issue Type: Task Reporter: Dave Meikle Assignee: Dave Meikle Priority: Trivial We have been asked to provide information to Oracle around the internal API usage in Apache Tika to support move to JDK 9, which contains significant changes. {quote} Hi David, My name is Rory O'Donnell, I am the OpenJDK Quality Group Lead. I'm contacting you because your open source project seems to be a very popular dependency for other open source projects. As part of the preparations for JDK 9, Oracle’s engineers have been analyzing open source projects like yours to understand usage. One area of concern involves identifying compatibility problems, such as reliance on JDK-internal APIs. Our engineers have already prepared guidance on migrating some of the more common usage patterns of JDK-internal APIs to supported public interfaces. The list is on the OpenJDK wiki [0]. As part of the ongoing development of JDK 9, I would like to inquire about your usage of JDK-internal APIs and to encourage migration towards supported Java APIs if necessary. The first step is to identify if your application(s) is leveraging internal APIs. Step 1: Download JDeps. Just download a preview release of JDK8(JDeps Download). You do not need to actually test or run your application on JDK8. JDeps(Docs) looks through JAR files and identifies which JAR files use internal APIs and then lists those APIs. Step 2: To run JDeps against an application. The command looks like: jdk8/bin/jdeps -P -jdkinternals *.jar your-application.jdeps.txt The output inside your-application.jdeps.txt will look like: your.package (Filename.jar) - com.sun.corba.seJDK internal API (rt.jar) 3rd party library using Internal APIs: If your analysis uncovers a third-party component that you rely on, you can contact the provider and let them know of the upcoming changes. You can then either work with the provider to get an updated library that won't rely on Internal APIs, or you can find an alternative provider for the capabilities that the offending library provides. Dynamic use of Internal APIs: JDeps can not detect dynamic use of internal APIs, for example through reflection, service loaders and similar mechanisms. Rgds,Rory [0] https://wiki.openjdk.java.net/display/JDK8/Java+Dependency+Analysis+Tool -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1637) Oracle internal API jdeps request for information
[ https://issues.apache.org/jira/browse/TIKA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558087#comment-14558087 ] Uwe Schindler edited comment on TIKA-1637 at 5/25/15 10:18 AM: --- Hi Dave, forbidden-apis already forbids use of internal APIs like sun.misc.Unsafe, see TIKA's parent POM: {{internalRuntimeForbiddentrue/internalRuntimeForbidden}} It also forbids deprecated APIs: {{bundledSignaturejdk-deprecated/bundledSignature}}; This is important because for the first time in Java's Lifetime, JDK 9 really removed some deprecated stuff!!! (because this was needed for modularization) But indeed, we don't see usage in dependent libraries, so it would be good to run jdeps on all the millions of dependencies! :-) was (Author: thetaphi): Hi Dave, forbidden-apis already forbids use of internal APIs like sun.misc.Unsafe, see TIKA's parent POM: internalRuntimeForbiddentrue/internalRuntimeForbidden But indeed, we don't see usage in dependent libraries, so it would be good to run jdeps on all the millions of dependencies! :-) Oracle internal API jdeps request for information - Key: TIKA-1637 URL: https://issues.apache.org/jira/browse/TIKA-1637 Project: Tika Issue Type: Task Reporter: Dave Meikle Assignee: Dave Meikle Priority: Trivial We have been asked to provide information to Oracle around the internal API usage in Apache Tika to support move to JDK 9, which contains significant changes. {quote} Hi David, My name is Rory O'Donnell, I am the OpenJDK Quality Group Lead. I'm contacting you because your open source project seems to be a very popular dependency for other open source projects. As part of the preparations for JDK 9, Oracle’s engineers have been analyzing open source projects like yours to understand usage. One area of concern involves identifying compatibility problems, such as reliance on JDK-internal APIs. Our engineers have already prepared guidance on migrating some of the more common usage patterns of JDK-internal APIs to supported public interfaces. The list is on the OpenJDK wiki [0]. As part of the ongoing development of JDK 9, I would like to inquire about your usage of JDK-internal APIs and to encourage migration towards supported Java APIs if necessary. The first step is to identify if your application(s) is leveraging internal APIs. Step 1: Download JDeps. Just download a preview release of JDK8(JDeps Download). You do not need to actually test or run your application on JDK8. JDeps(Docs) looks through JAR files and identifies which JAR files use internal APIs and then lists those APIs. Step 2: To run JDeps against an application. The command looks like: jdk8/bin/jdeps -P -jdkinternals *.jar your-application.jdeps.txt The output inside your-application.jdeps.txt will look like: your.package (Filename.jar) - com.sun.corba.seJDK internal API (rt.jar) 3rd party library using Internal APIs: If your analysis uncovers a third-party component that you rely on, you can contact the provider and let them know of the upcoming changes. You can then either work with the provider to get an updated library that won't rely on Internal APIs, or you can find an alternative provider for the capabilities that the offending library provides. Dynamic use of Internal APIs: JDeps can not detect dynamic use of internal APIs, for example through reflection, service loaders and similar mechanisms. Rgds,Rory [0] https://wiki.openjdk.java.net/display/JDK8/Java+Dependency+Analysis+Tool -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1628) ExternalParser.check should return false if it hits SecurityException
[ https://issues.apache.org/jira/browse/TIKA-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539838#comment-14539838 ] Uwe Schindler commented on TIKA-1628: - +1 to the patch. I don't think we need a test! ExternalParser.check should return false if it hits SecurityException - Key: TIKA-1628 URL: https://issues.apache.org/jira/browse/TIKA-1628 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.9 Attachments: TIKA-1628.patch If you run Tika with a Java security manager that blocks execution of external processes, ExternalParser.check throws SecurityException, but I think it should just return false? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525112#comment-14525112 ] Uwe Schindler edited comment on TIKA-1582 at 5/2/15 7:35 AM: - Hi Chris, there is already forbidden-apis 1.8 available! The main new feature here is to allow supressing forbidden checks in classes/methods/fields using annotations ({{@SuppressForbidden}} or similar, configurable). In the past, we excluded the whole class files in Lucene/Elasticsearch (e.g. where we want to write to System.out because its a command line tool, which is otherwise completely forbidden in Lucene), now we can annotate those methods (see LUCENE-6420). If we also need this functionality in TIKA, too - we can update. Bumping the version number in any case is fine, too (e.g., for Java 9 support)! Uwe was (Author: thetaphi): Hi Chris, there is already forbidden-apis 1.8 available! The main new feature here is to allow supressing forbidden checks in classes/methods/fields using annotations ({@SuppressForbidden} or similar, configurable). In the past, we excluded the whole class files in Lucene/Elasticsearch (e.g. where we want to write to System.out because its a command line tool, which is otherwise completely forbidden in Lucene), now we can annotate those methods (see LUCENE-6420). If we also need this functionality in TIKA, too - we can update. Bumping the version number in any case is fine, too (e.g., for Java 9 support)! Uwe Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the
[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram
[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525112#comment-14525112 ] Uwe Schindler commented on TIKA-1582: - Hi Chris, there is already forbidden-apis 1.8 available! The main new feature here is to allow supressing forbidden checks in classes/methods/fields using annotations ({@SuppressForbidden} or similar, configurable). In the past, we excluded the whole class files in Lucene/Elasticsearch (e.g. where we want to write to System.out because its a command line tool, which is otherwise completely forbidden in Lucene), now we can annotate those methods (see LUCENE-6420). If we also need this functionality in TIKA, too - we can update. Bumping the version number in any case is fine, too (e.g., for Java 9 support)! Uwe Mime Detection based on neural networks with Byte-frequency-histogram -- Key: TIKA-1582 URL: https://issues.apache.org/jira/browse/TIKA-1582 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.7 Reporter: Luke sh Assignee: Chris A. Mattmann Priority: Trivial Labels: memex Fix For: 1.9 Attachments: nnmodel.docx, week2-report-histogram comparison.docx, week6 report.docx Content-based mime type detection is one of the popular approaches to detect mime type, there are others based on file extension and magic numbers ; And currently Tika has implemented 3 approaches in detecting mime types; They are : 1) file extensions 2) magic numbers (the most trustworthy in tika) 3) content-type(the header in the http response if present and available) Content-based mime type detection however analyses the distribution of the entire stream of bytes and find a similar pattern for the same type and build a function that is able to group them into one or several classes so as to classify and predict; It is believed this feature might broaden the usage of Tika with a bit more security enforcement for mime type detection. Because we want to build a model that is etched with the patterns it has seen, in some situations we may not trust those types which have not been trained/learned by the model. In some situations, magic numbers imbedded in the files can be copied but the actual content could be a potentially detrimental Troy program. By enforcing the trust on byte frequency patterns, we are able to enhance the security of the detection. The proposed content-based mime detection to be integrated into Tika is based on the machine learning algorithm i.e. neural network with back-propagation. The input: 0-255 bins each of which represent a byte, and and each of which stores the count of occurrences for each byte, and the byte frequency histograms are normalized to fall in the range between 0 and 1, they then are passed to a companding function to enhancement the infrequent bytes. The output of the neural network is a binary decision 1 or 0; Notice BTW, the proposed feature will be implemented with GRB file type as one example. In this example, we build a model that is able to classify GRB file type from non-GRB file types, notice the size of non-GRB files is huge and cannot be easily defined, so there need to be as many negative training example as possible to form this non-GRB types decision boundary. The Neural networks is considered as two stage of processes. Training and classification. The training can be done in any programming language, in this feature /research, the training of neural network is implemented in R and the source can be found in my github repository i.e. https://github.com/LukeLiush/filetypeDetection; i am also going to post a document that describe the use of the program, the syntax/ format of the input and output. After training, we need to export the model and import it to Tika; in Tika, we create a TrainedModelDetector that reads this model file with one or more model parameters or several model files,so it can detect the mime types with the model of those mime types. Details of the research and usage with this proposed feature will be posted on my github shortly. It is worth noting again that in this research we only worked out one model - GRB as one example to demonstrate the use of this content-based mime detection. One of the challenges again is that the non-GRB file types cannot be clearly defined unless we feed our model with some example data for all of the existing file types in the world, but this seems to be too utopian and a bit less likely, so it is better that the set of class/types is given and defined in advance to minimize the problem domain. Another challenge is the size of the training data;
[jira] [Commented] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385803#comment-14385803 ] Uwe Schindler commented on TIKA-1511: - Solr uses ANT + IVY to build. We don't use transitive dependencies at all! So whenever updating TIKA, the person who does this prints the dependency tree and then fills all required information into the ivy.xml file and our ivy-versions.properties file :-) In general, we carefully decide, which dependencies are really needed. Because TIKA automatically disables parser which do not load, we have already removed various files (like netcdf parser - LGPL) or the ASM parser (we dont support indexing Java Class files by default). For the current one: We dont want to have native libraries anywhere (we don't even ship our own native libs for WindowsDirectory). Users need to do this themselves start msvcc/gcc. So we would not ship wth SQLite support by default. In general it would be good to have some easier plugin mechanism to allow Solr to pick only some parsers they ship by default and those the user can download (e.g. by a script). So it would be good to have multiple parser-JARS. So maybe put all crazy parsers that fork processes or call native libs into a separate TIKA parser bundle. The default one should only have pure-java stuff with as few dependencies as possible... Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333400#comment-14333400 ] Uwe Schindler commented on TIKA-1558: - Hi, Lucene uses SPI for its index codecs, so we are familar with SPI. But we have no problems with order of classpath. We just preserve what Java delivers in Classloader.getResources(). But order is not really important (it was important for testing in Lucene 4.x, but that's history since last Friday). We already have a custom TikaConfig class so I am happy to use that. In our case we would only put the SPI exclusion into our test classpath. But TikaConfig is also fine. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333400#comment-14333400 ] Uwe Schindler edited comment on TIKA-1558 at 2/23/15 4:06 PM: -- Hi, Lucene uses SPI for its index codecs, so we are familar with SPI. But we have no problems with order of classpath. We just preserve what Java delivers in Classloader.getResources(). But order is not really important (it was important for testing in Lucene 4.x, but that's history since last Friday). We already have custom TikaConfig support in the extraction module, so I am happy to use that. In our case we would only put the SPI exclusion into our test classpath. But TikaConfig is also fine. was (Author: thetaphi): Hi, Lucene uses SPI for its index codecs, so we are familar with SPI. But we have no problems with order of classpath. We just preserve what Java delivers in Classloader.getResources(). But order is not really important (it was important for testing in Lucene 4.x, but that's history since last Friday). We already have a custom TikaConfig class so I am happy to use that. In our case we would only put the SPI exclusion into our test classpath. But TikaConfig is also fine. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333628#comment-14333628 ] Uwe Schindler commented on TIKA-1526: - Thanks David! ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523 ] Uwe Schindler commented on TIKA-1557: - I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should work alos for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523 ] Uwe Schindler edited comment on TIKA-1557 at 2/20/15 9:05 PM: -- I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should also work for subclasses, so one could disable all ExternalParser subclasses by adding ExternalParser to blacklist. was (Author: thetaphi): I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should also work for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523 ] Uwe Schindler edited comment on TIKA-1557 at 2/20/15 8:42 PM: -- I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should also work for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. was (Author: thetaphi): I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should work alos for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329276#comment-14329276 ] Uwe Schindler commented on TIKA-1555: - Also, this issue in the JDK is already fixed in Java 7u80 and 8u40 (to be released in the next 2 months): https://bugs.openjdk.java.net/browse/JDK-8047340 posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329282#comment-14329282 ] Uwe Schindler commented on TIKA-1555: - @UweSays: https://twitter.com/UweSays/status/501425093613207552 posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329272#comment-14329272 ] Uwe Schindler commented on TIKA-1555: - This is a duplicate of TIKA-1526. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329350#comment-14329350 ] Uwe Schindler commented on TIKA-1526: - I was not able to test this, because I have no MacOSX computer and FreeBSD is only a Jenkins server Maybe [~dadoonet] can try the same with elasticsearch-mapper-attachments module. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329344#comment-14329344 ] Uwe Schindler commented on TIKA-1555: - Hi David, can you try to compile Tika from current trunk checkout and test it with ES? If this fixes the issue with turkish locale, could you report on TIKA-1526. For me its hard to reproduce with Windows or Linux. I just have analyzed the issue and reported the bug to Oracle and fixed Solr 5.0, but I did no thorough testing on the Tika issue. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329364#comment-14329364 ] Uwe Schindler commented on TIKA-1555: - bq. BTW I wonder if we could add a setting which can return false for TesseractOCRParser#hasTesseract even if we have tesseract available. You can remove / add custom parsers through the TikaConfig. But I agree, its hard to maintain, because you have to provide a static list. I would really like to have a separate TikaConfig option to explicitely disable some parsers, so I can use the default SPI lookup, but blacklist parsers. We would like to do the same in Solr, too. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329474#comment-14329474 ] Uwe Schindler commented on TIKA-1555: - bq. You can also disable OCR by setting the Tesseract path to in the TesseractOCRConfig. This did not work. If this would disable the fork I would be happy. But it just disables parser as side effect because it tries to fork an invalid process path which is created from empty string and sone sufix. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289125#comment-14289125 ] Uwe Schindler commented on TIKA-1526: - [~grossws]: This bug is not in Maven itsself, the problem here is unsolved bug in the JDK itsself. Maven is perfectly fine, but because of the JDK bug, Maven cannot spawn external processes. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288963#comment-14288963 ] Uwe Schindler commented on TIKA-1526: - I tried it with maven, but this is all too funny. This bug also affects Maven... {noformat} [uschindler@lucene ~]$ export MAVEN_OPTS=-Duser.language=tr [uschindler@lucene ~]$ mvn --- constituent[0]: file:/usr/local/share/java/maven3/lib/aether-connector-wagon-1.13.1.jar constituent[1]: file:/usr/local/share/java/maven3/lib/maven-repository-metadata-3.0.4.jar constituent[2]: file:/usr/local/share/java/maven3/lib/plexus-sec-dispatcher-1.3.jar constituent[3]: file:/usr/local/share/java/maven3/lib/aether-spi-1.13.1.jar constituent[4]: file:/usr/local/share/java/maven3/lib/maven-compat-3.0.4.jar constituent[5]: file:/usr/local/share/java/maven3/lib/plexus-component-annotations-1.5.5.jar constituent[6]: file:/usr/local/share/java/maven3/lib/plexus-cipher-1.7.jar constituent[7]: file:/usr/local/share/java/maven3/lib/sisu-guava-0.9.9.jar constituent[8]: file:/usr/local/share/java/maven3/lib/maven-core-3.0.4.jar constituent[9]: file:/usr/local/share/java/maven3/lib/plexus-utils-2.0.6.jar constituent[10]: file:/usr/local/share/java/maven3/lib/wagon-provider-api-2.2.jar constituent[11]: file:/usr/local/share/java/maven3/lib/maven-plugin-api-3.0.4.jar constituent[12]: file:/usr/local/share/java/maven3/lib/maven-model-builder-3.0.4.jar constituent[13]: file:/usr/local/share/java/maven3/lib/maven-settings-3.0.4.jar constituent[14]: file:/usr/local/share/java/maven3/lib/sisu-inject-bean-2.3.0.jar constituent[15]: file:/usr/local/share/java/maven3/lib/wagon-http-2.2-shaded.jar constituent[16]: file:/usr/local/share/java/maven3/lib/maven-aether-provider-3.0.4.jar constituent[17]: file:/usr/local/share/java/maven3/lib/sisu-inject-plexus-2.3.0.jar constituent[18]: file:/usr/local/share/java/maven3/lib/maven-artifact-3.0.4.jar constituent[19]: file:/usr/local/share/java/maven3/lib/maven-model-3.0.4.jar constituent[20]: file:/usr/local/share/java/maven3/lib/wagon-file-2.2.jar constituent[21]: file:/usr/local/share/java/maven3/lib/maven-embedder-3.0.4.jar constituent[22]: file:/usr/local/share/java/maven3/lib/sisu-guice-3.1.0-no_aop.jar constituent[23]: file:/usr/local/share/java/maven3/lib/maven-settings-builder-3.0.4.jar constituent[24]: file:/usr/local/share/java/maven3/lib/plexus-interpolation-1.14.jar constituent[25]: file:/usr/local/share/java/maven3/lib/aether-impl-1.13.1.jar constituent[26]: file:/usr/local/share/java/maven3/lib/aether-api-1.13.1.jar constituent[27]: file:/usr/local/share/java/maven3/lib/aether-util-1.13.1.jar constituent[28]: file:/usr/local/share/java/maven3/lib/commons-cli-1.2.jar --- Exception in thread main java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:111) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:450) at java.lang.Runtime.exec(Runtime.java:347) at org.codehaus.plexus.interpolation.os.OperatingSystemUtils.getSystemEnvVars(OperatingSystemUtils.java:86) at org.codehaus.plexus.interpolation.EnvarBasedValueSource.getEnvars(EnvarBasedValueSource.java:74) at org.codehaus.plexus.interpolation.EnvarBasedValueSource.init(EnvarBasedValueSource.java:64) at org.codehaus.plexus.interpolation.EnvarBasedValueSource.init(EnvarBasedValueSource.java:50) at org.apache.maven.settings.building.DefaultSettingsBuilder.interpolate(DefaultSettingsBuilder.java:222) at org.apache.maven.settings.building.DefaultSettingsBuilder.build(DefaultSettingsBuilder.java:101) at org.apache.maven.cli.MavenCli.settings(MavenCli.java:725) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:193) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at
[jira] [Commented] (TIKA-1529) Turn forbidden-apis back on
[ https://issues.apache.org/jira/browse/TIKA-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289438#comment-14289438 ] Uwe Schindler commented on TIKA-1529: - If you just check for ASCII chars in some string of unknown encoding, the easiest is to use US-ASCII as charset, this will always work, also with UTF-8 :-) Turn forbidden-apis back on --- Key: TIKA-1529 URL: https://issues.apache.org/jira/browse/TIKA-1529 Project: Tika Issue Type: Bug Reporter: Tim Allison Priority: Minor [~thetaphi] recently noticed that forbidden-apis was turned off in r1624185, and he submitted a patch to the dev list. Let's turn it back on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289182#comment-14289182 ] Uwe Schindler edited comment on TIKA-1526 at 1/23/15 12:32 PM: --- To work around this bug you can in fact do this. It is just bad to change User's default locale, which may especially break multi-threaded applications. One solution could be: During startup of the JVM (in the Plexus launcher's main method) you can do the following: - check for locale, we do this like that: {{new Locale(tr).getLanguage().equals(Locale.getDefault().getLanguage())}} (it is important to do the check like this, because otherwise its not guaranteed that it really works, especially in newer java versions!!!) - if its such a locale, switch to Locale.ROOT (save original) in a single-threaded environment (this is why it should be in main launcher) - execute a fake UNIX command, like /bin/true. You can also execute some non-existing bullshit that just fails. The call is just there to statically initalize the broken UnixProcess class. Once it is initialized correctly it works - switch back to saved locale was (Author: thetaphi): To work around this bug you can in fact do this. It is just bad to change User's default locale, which may especially break multi-threaded applications. One solution could be: During startup of the JVM (in the Plexus launcher's main method) you can do the following: - check for locale, we do this like that: {{new Locale(tr).getLanguage().equals(Locale.getDefault().getLanguage())}} (it is important to do the check like this, because otherwise its not guaranteed that it really works, especially in newer java versions!!!) - if its such a locale, switch to Locale.ROOT (save original) in a single-threaded environment (this is why it should be in main launcher) - execute a fake UNIX command, like /bin/true. You can also execute northing, it is just there to statically initalize the broken UnixProcess class. Once it is initialized correctly it works - switch back to saved locale ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289182#comment-14289182 ] Uwe Schindler commented on TIKA-1526: - To work around this bug you can in fact do this. It is just bad to change User's default locale, which may especially break multi-threaded applications. One solution could be: During startup of the JVM (in the Plexus launcher's main method) you can do the following: - check for locale, we do this like that: {{new Locale(tr).getLanguage().equals(Locale.getDefault().getLanguage())}} (it is important to do the check like this, because otherwise its not guaranteed that it really works, especially in newer java versions!!!) - if its such a locale, switch to Locale.ROOT (save original) in a single-threaded environment (this is why it should be in main launcher) - execute a fake UNIX command, like /bin/true. You can also execute northing, it is just there to statically initalize the broken UnixProcess class. Once it is initialized correctly it works - switch back to saved locale ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288444#comment-14288444 ] Uwe Schindler commented on TIKA-1526: - Hi Tylor: The problem is explained above. To replicate the problem you have to be careful: The original error happens *exactly once*. All later tries to use the same JVM will cause a NoClassDefFoundError on UnixProcess class. In fact all later tries to execute and fork a process will fail, but with a NoClassDefFoundError. Unfortunately I am very tired at the moment, it is past midnight. The main problem is that all other ExternalParserTests will/may fail afterwards in the same JVM if the turkish locale is used. The commit will fix the issue we see in Solr, but the original issue may still survive if you really try to use ExternalParser for other tests. For which other parsers is it used currently? Only for tesseract or also other ones? In Solr we have the problem, because the TesseractParser fails to execute the initialization (which MIME types it is responsble for) - and thats the fatal problem. I have no idea about other parsers, if they just fail while parsing I don't care. The big problem is the Tesseract parser that fails in turkish locale and blocks other parsers to execute, because the call to getSupportedTypes() fails [and thats the horrible thing in this bug]. So basically to reproduce: Choose exactly one test you know that fails and try with and without the patch. Don't run other tests that may spawn processes in the same JVM. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking
[jira] [Comment Edited] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288444#comment-14288444 ] Uwe Schindler edited comment on TIKA-1526 at 1/22/15 11:29 PM: --- Hi Tylor: The problem is explained above. To reproduce the problem you have to be careful: The original error happens *exactly once*. All later tries to use the same JVM will cause a NoClassDefFoundError on UnixProcess class. Unfortunately I am very tired at the moment, it is past midnight. The main problem is that all other ExternalParserTests will/may fail afterwards in the same JVM if the turkish locale is used. The commit will fix the issue we see in Solr, but the original issue may still survive if you really try to use ExternalParser for other tests. For which other parsers is it used currently? Only for tesseract or also other ones? In Solr we have the problem, because the TesseractParser fails to execute the initialization (which MIME types it is responsble for) - and thats the fatal problem. I have no idea about other parsers, if they just fail while parsing I don't care. The big problem is the Tesseract parser that fails in turkish locale and blocks other parsers to execute, because the call to getSupportedTypes() fails [and thats the horrible thing in this bug]. So basically to reproduce: Choose exactly one test you know that fails and try with and without the patch. Don't run other tests that may spawn processes in the same JVM. was (Author: thetaphi): Hi Tylor: The problem is explained above. To replicate the problem you have to be careful: The original error happens *exactly once*. All later tries to use the same JVM will cause a NoClassDefFoundError on UnixProcess class. In fact all later tries to execute and fork a process will fail, but with a NoClassDefFoundError. Unfortunately I am very tired at the moment, it is past midnight. The main problem is that all other ExternalParserTests will/may fail afterwards in the same JVM if the turkish locale is used. The commit will fix the issue we see in Solr, but the original issue may still survive if you really try to use ExternalParser for other tests. For which other parsers is it used currently? Only for tesseract or also other ones? In Solr we have the problem, because the TesseractParser fails to execute the initialization (which MIME types it is responsble for) - and thats the fatal problem. I have no idea about other parsers, if they just fail while parsing I don't care. The big problem is the Tesseract parser that fails in turkish locale and blocks other parsers to execute, because the call to getSupportedTypes() fails [and thats the horrible thing in this bug]. So basically to reproduce: Choose exactly one test you know that fails and try with and without the patch. Don't run other tests that may spawn processes in the same JVM. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287820#comment-14287820 ] Uwe Schindler commented on TIKA-1526: - FYI: The underlying bug in the JVM will never be fixed in Java 6. Java 9 previews are no longer affected, but Java 7 and Java 8 are still broken (inclduing the update from yesterday). Oracle possibly will fix in 7u80 (last Java 7 release before EOL) and 8u40. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287824#comment-14287824 ] Uwe Schindler commented on TIKA-1526: - Tim: Linux does not use posis spawn. You ned MacOSX or Solaris. Oracle has a completely different implementation for spawning processes in Linux. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287824#comment-14287824 ] Uwe Schindler edited comment on TIKA-1526 at 1/22/15 5:36 PM: -- Tim: Linux does not use posix spawn. You ned MacOSX or Solaris. Oracle has a completely different implementation for spawning processes in Linux. was (Author: thetaphi): Tim: Linux does not use posis spawn. You ned MacOSX or Solaris. Oracle has a completely different implementation for spawning processes in Linux. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287850#comment-14287850 ] Uwe Schindler commented on TIKA-1526: - There is also a second problem: The bug is in the static \{\} initializer of the UNIXProcess class. So it happens only when the class is loaded first. If it was loaded correctly, the class is initialized with right settings and passes (in fact, also with turkish locale). But if it fails for the first time, UNIXProcess is broken for the whole lifetime of the JVM (also with good locales). Because UNIXProcess failed to initialize, the JVM marks it as broken and you get a NoClassDefFound error. The problem does not happen on Linux, because the default value of the problematic system property is there initialized with some other value, that does not contain i which is affected by the famous upper/lowercasing bug: http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html So if some other test is executing some external process before and has non-Turkish locale, also calls with turkish succeed. Because of that we test Lucene with all possible Locales set before the JVM is starting. We don't switch the locale actively during tests. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this
[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5
[ https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283723#comment-14283723 ] Uwe Schindler commented on TIKA-1435: - Indeed this confused me while doing the Apache Solr update (SOLR-6991). Apache Lucene/Solr does not allow transitive dependencies, so everything is declared explicit using IVY. This caused some headache while doing mvn:list-dependencies and checking all of them manually. Update rome dependency to 1.5 - Key: TIKA-1435 URL: https://issues.apache.org/jira/browse/TIKA-1435 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Johannes Mockenhaupt Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.8 Attachments: netcdf-deps-changes.diff Rome 1.5 has been released to Sonatype (https://github.com/rometools/rome/issues/183). Though the website (http://rometools.github.io/rome/) is blissfully ignorant of that. The update is mostly maintenance, adopting slf4j and generics as well as moving the namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148 ] Uwe Schindler commented on TIKA-1523: - Hi, I did some recherche: This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the page count is wrong initially on saving, if you don't scroll to the end. People were complaining about that at that time, too, because it caused sometimes the total page number in footnotes to be incorrect, too. http://support.microsoft.com/kb/212653/en-us See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html (German only, 1st comment): {quote} SSD 26.04.2004, 21:07 Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl (steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran kann das liegen, wie kann ich das ändern? {quote} And: https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs You see, initially the page count is wrong. If you open a file with Word 2000 / 9.0 and safe it without waiting until the full count was calculated (computers were slower at that time), it saved 1. :-) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116 ] Uwe Schindler edited comment on TIKA-1523 at 1/19/15 10:50 PM: --- Yes. I extracts just the metadata with COM interface for the quickview windows component (you don't even need Word installed for that). So I think this is an issue with this old version of Word. In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows 1. But here, the metadata as saved in the file is simply 1 or maybe nothing (see below). POI does not reflow the layout to calculate that information. This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file read only (which it does because its downloaded from internet), it shows in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue. was (Author: thetaphi): Yes. I extracts just the metadata. So I think this is an issue with this old version of Word. In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows 1. But here, the metadata as saved in the file is simply 1 or maybe nothing (see below). POI does not reflow the layout to calculate that information. This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file read only (which it does because its downloaded from internet), it shows in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue. metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: screenshot-2.png metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: (was: screenshot-2.png) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: screenshot-2.png metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283116#comment-14283116 ] Uwe Schindler commented on TIKA-1523: - Yes. I extracts just the metadata. So I think this is an issue with this old version of Word. In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows 1. But here, the metadata as saved in the file is simply 1 or maybe nothing (see below). POI does not reflow the layout to calculate that information. This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file read only (which it does because its downloaded from internet), it shows in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue. metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283148#comment-14283148 ] Uwe Schindler edited comment on TIKA-1523 at 1/19/15 11:16 PM: --- Hi, I did some recherche: This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the page count is wrong initially on saving, if you don't scroll to the end. People were complaining about that at that time, too, because it caused sometimes the total page number in footnotes to be incorrect, too. http://support.microsoft.com/kb/212653/en-us See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html (German only, 1st comment): {quote} SSD 26.04.2004, 21:07 Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl (steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran kann das liegen, wie kann ich das ändern? {quote} And: https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs {quote} Anyone can help me with this? If I take out Sleep 1, myDoc.BuiltinDocumentProperties(wdPropertyPages) doesnt return the correct number of pages sometimes. For example, if a document has 200 pages, it may come out to return 140, or sometimes 199, instead of 200. To me, it seems it takes some time for MS word to think and get the number of pages. After i put Sleep 1, 99% I got the correct number of pages. However, this will take very long time to process as I need to read 200 to 300 files and the number of pages from each files. Please let me know if there is another better solution for this. {quote} You see, initially the page count is wrong. If you open a file with Word 2000 / 9.0 and save it without waiting until the full count was calculated (computers were slower at that time), it saved 1. :-) was (Author: thetaphi): Hi, I did some recherche: This is a bug in Word 2000 (aka Word 9.0) fixed in later versions. Indeed the page count is wrong initially on saving, if you don't scroll to the end. People were complaining about that at that time, too, because it caused sometimes the total page number in footnotes to be incorrect, too. http://support.microsoft.com/kb/212653/en-us See also: http://www.ms-office-forum.net/forum/archive/index.php?t-125861.html (German only, 1st comment): {quote} SSD 26.04.2004, 21:07 Ich übernehme die Seitenzahl aus den Eigenschaften einer Word-Datei in Access. Jetzt habe ich das Problem, daß wenn die Datei in Word geöffnet ist in den Eigenschaften die richtige Seitenzahl angezeigt wird. Es die Datei geschossen und ich gehe im Fenster öffnen auf die Eigenschaften stimmt die Seitenzahl (steht immer erstmal 1 Seite) erst nach mehrmaligem speichern der Datei, woran kann das liegen, wie kann ich das ändern? {quote} And: https://groups.google.com/forum/#!topic/microsoft.public.word.vba.general/daf-sUpPlgs You see, initially the page count is wrong. If you open a file with Word 2000 / 9.0 and safe it without waiting until the full count was calculated (computers were slower at that time), it saved 1. :-) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1523: Attachment: screenshot-1.png metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
[ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283092#comment-14283092 ] Uwe Schindler commented on TIKA-1523: - If I save the file with Office 2010, the page number is updated and shows correct in right-click/Properties. TIKA also shows it. metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0 -- Key: TIKA-1523 URL: https://issues.apache.org/jira/browse/TIKA-1523 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.7 Environment: Ubuntu Reporter: Yamileydis Veranes Assignee: Konstantin Gribov Attachments: Sigmund Freud.doc, screenshot-1.png When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1457) NullPointerException in tika-app, parsing PDF content
[ https://issues.apache.org/jira/browse/TIKA-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186533#comment-14186533 ] Uwe Schindler commented on TIKA-1457: - Hi, the next version of Solr with TIKA 1.6 will be Solr 5.0, there will be no more 4.x releases (except bugfix/security). If TIK 1.7 comes out in the meantime, we will update. About replacing TIKA in a given Solr installation: Yes this may work in most cases. For the change TIKA 1.5 - TIKA 1.6 in current Lucene/Solr 5.x branch, I only changed the dependencies - code changes in the main source code were not needed (the API of TIKA itsself is quite stable). I only had to fix one test because of an additional new header X-Parsed-By, which made the test fail. NullPointerException in tika-app, parsing PDF content - Key: TIKA-1457 URL: https://issues.apache.org/jira/browse/TIKA-1457 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS - Linux Centos 6.5 Web APP - Tomcat6 Using Solr 4.10 Tika Jar * tika-core-1.5.jar * tika-parsers-1.5.jar * tika-xmp-1.5.jar * pdfbox-1.8.4.jar Reporter: Tadeu Alves Labels: bug, parser, solr, tika,text-extraction Fix For: 1.6 When I try to extract text from some pdf files with the tika app 1.5 null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 19 more Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:658) at org.apache.pdfbox.util.DateConverter.parseDate(DateConverter.java:680) at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:808) at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:780) at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:754) at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:797) at org.apache.pdfbox.pdmodel.PDDocumentInformation.getModificationDate(PDDocumentInformation.java:232) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:176) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:142) at
[jira] [Comment Edited] (TIKA-1457) NullPointerException in tika-app, parsing PDF content
[ https://issues.apache.org/jira/browse/TIKA-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186533#comment-14186533 ] Uwe Schindler edited comment on TIKA-1457 at 10/28/14 7:50 AM: --- Hi, the next version of Solr with TIKA 1.6 will be Solr 5.0, there will be no more 4.x releases (except bugfix/security). If TIKA 1.7 comes out in the meantime, we will update. About replacing TIKA in a given Solr installation: Yes this may work in most cases. For the change TIKA 1.5 - TIKA 1.6 in current Lucene/Solr 5.x branch, I only changed the dependencies - code changes in the main source code were not needed (the API of TIKA itsself is quite stable). I only had to fix one test because of an additional new header X-Parsed-By, which made the test fail. Be sure to exchange *all* JAR files (not only TIKA, also its deps) in contrib/extraction/lib!!! was (Author: thetaphi): Hi, the next version of Solr with TIKA 1.6 will be Solr 5.0, there will be no more 4.x releases (except bugfix/security). If TIKA 1.7 comes out in the meantime, we will update. About replacing TIKA in a given Solr installation: Yes this may work in most cases. For the change TIKA 1.5 - TIKA 1.6 in current Lucene/Solr 5.x branch, I only changed the dependencies - code changes in the main source code were not needed (the API of TIKA itsself is quite stable). I only had to fix one test because of an additional new header X-Parsed-By, which made the test fail. NullPointerException in tika-app, parsing PDF content - Key: TIKA-1457 URL: https://issues.apache.org/jira/browse/TIKA-1457 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS - Linux Centos 6.5 Web APP - Tomcat6 Using Solr 4.10 Tika Jar * tika-core-1.5.jar * tika-parsers-1.5.jar * tika-xmp-1.5.jar * pdfbox-1.8.4.jar Reporter: Tadeu Alves Labels: bug, parser, solr, tika,text-extraction Fix For: 1.6 When I try to extract text from some pdf files with the tika app 1.5 null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 19 more Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:658)
[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184073#comment-14184073 ] Uwe Schindler commented on TIKA-1387: - I think this is already committed an working. I think the issue was just not closed. In parent/pom.xml the plugin is enabled... So We can keep 1.7 as fix version and resolve this issue. Or do I miss something? Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Assignee: Tyler Palsulich Fix For: 1.8 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[])
[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095798#comment-14095798 ] Uwe Schindler commented on TIKA-1387: - I think, for messages written in english language (like those written to logs), ENGLISH is more correct. But it does not really matter. About the charsets: I would define a constant in IOUtils {{public final Charset UTF_8 = Charset.forName(UTF-8);}} and then pass this to all methods that accept it (like Readers, String,...). This is also faster than a sychronized String lookup on every conversion, like done by the standard default charset or String charset parameter. Java 7 has StandardCharsets.UTF_8 but we cannot use this at the moment. But its defined like the one I propose for IOUtils. Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Assignee: Tyler Palsulich Fix For: 1.7 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default
[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095853#comment-14095853 ] Uwe Schindler commented on TIKA-1387: - Nick: in ImageMetadataExtractor.java, the date format is static, so it does not help that a new instance is created. If you remove the static it should be fine. Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Assignee: Tyler Palsulich Fix For: 1.7 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in
[jira] [Created] (TIKA-1387) Add forbidden-apis checker to TIKA build
Uwe Schindler created TIKA-1387: --- Summary: Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:91) [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:98) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:100) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:396) [ERROR] Forbidden method invocation:
[jira] [Created] (TIKA-1386) Add forbidden-apis checker to TIKA build
Uwe Schindler created TIKA-1386: --- Summary: Add forbidden-apis checker to TIKA build Key: TIKA-1386 URL: https://issues.apache.org/jira/browse/TIKA-1386 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:91) [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:98) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:100) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:396) [ERROR] Forbidden method invocation:
[jira] [Closed] (TIKA-1386) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed TIKA-1386. --- Resolution: Duplicate JIRA hung and created the issue 2 times. Add forbidden-apis checker to TIKA build Key: TIKA-1386 URL: https://issues.apache.org/jira/browse/TIKA-1386 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:91) [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:98) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in
[jira] [Updated] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1387: Attachment: TIKA-1387.patch This patch refactors the tika-java7 module a bit, so the forbidden-api checker also uses the correct signatures (Java 7). This was done by redefining the parent-pom properties instead of duplicating the compiler and forbidden plugins. Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Attachments: TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:91) [ERROR] Forbidden method
[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087489#comment-14087489 ] Uwe Schindler commented on TIKA-1387: - One suggestion: The official name of the properties for source/target are: maven.compile*r*.target and maven.compile*r*.source. I would suggest to change those. If this is dane, you can remove the explicit declaration in the plugin properties, because the maven compiler plugin and the maven forbiddenapis plugin read those properties. Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Attachments: TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation:
[jira] [Updated] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1387: Attachment: TIKA-1387.patch Patch with renamed properties to conform to Maven standards. Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Attachments: TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:91) [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale] [ERROR] in org.apache.tika.detect.MagicDetector (MagicDetector.java:98) [ERROR]
[jira] [Commented] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088088#comment-14088088 ] Uwe Schindler commented on TIKA-1387: - Hi I left a comment in the review. Was out for dinner. I would fix the issues in a different way at some places. Especially String#toLowerCase(Locale.getDefault()), which has crazy effects in some languages (in Turkish not even ASCII lower cases as expected). Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Assignee: Tyler Palsulich Fix For: 1.7 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation:
[jira] [Reopened] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened TIKA-1387: - I disagree wth some fixes, because they just workaround the forbidden-checks by still using system defaults. Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Assignee: Tyler Palsulich Fix For: 1.7 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:91) [ERROR] Forbidden method invocation:
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918634#comment-13918634 ] Uwe Schindler commented on TIKA-1252: - This could be a problem in Solr's DataImportHandler. I am not 100% sure, if this one supports multiple values per key. Maybe it is using a Map... In any case, if this is caused by Solr, I will move the issue over to SOLR. Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918643#comment-13918643 ] Uwe Schindler commented on TIKA-1252: - I did a quick check in [https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java] Solr does not seem to remove duplicate values (see {{addMetadata()}} and {{addField(String fname, String fval, String[] vals)}}). Furthermore, if the field is *not* multivalued, the data is concatenated with whitespace and put into *one* field (see line 226 ff). So this looks like a configuration problem or really a bug in TIKA. Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918643#comment-13918643 ] Uwe Schindler edited comment on TIKA-1252 at 3/3/14 10:17 PM: -- I did a quick check in [https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java] Solr does not seem to remove duplicate keys (see {{addMetadata()}} and {{addField(String fname, String fval, String[] vals)}}). Furthermore, if the field is *not* multivalued, the data is concatenated with whitespace and put into *one* field (see line 226 ff). So this looks like a configuration problem or really a bug in TIKA. was (Author: thetaphi): I did a quick check in [https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java] Solr does not seem to remove duplicate values (see {{addMetadata()}} and {{addField(String fname, String fval, String[] vals)}}). Furthermore, if the field is *not* multivalued, the data is concatenated with whitespace and put into *one* field (see line 226 ff). So this looks like a configuration problem or really a bug in TIKA. Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1211) OpenDocument (ODF) parser produces multipe startDocument() events
Uwe Schindler created TIKA-1211: --- Summary: OpenDocument (ODF) parser produces multipe startDocument() events Key: TIKA-1211 URL: https://issues.apache.org/jira/browse/TIKA-1211 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Uwe Schindler Related to SOLR-4809: Solr receives multiple startDocument events when parsing OpenDocumentFiles. The parser already prevents multiple endDocuments, but not multiple startDocuments. The bug was introduced when we added parsing content.xml and meta.xml (TIKA-736, but both feed elements to the XHTML output, so we get multiple start/endDocuments). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (TIKA-1211) OpenDocument (ODF) parser produces multiple startDocument() events
[ https://issues.apache.org/jira/browse/TIKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-1211: Summary: OpenDocument (ODF) parser produces multiple startDocument() events (was: OpenDocument (ODF) parser produces multipe startDocument() events) OpenDocument (ODF) parser produces multiple startDocument() events -- Key: TIKA-1211 URL: https://issues.apache.org/jira/browse/TIKA-1211 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Uwe Schindler Related to SOLR-4809: Solr receives multiple startDocument events when parsing OpenDocumentFiles. The parser already prevents multiple endDocuments, but not multiple startDocuments. The bug was introduced when we added parsing content.xml and meta.xml (TIKA-736, but both feed elements to the XHTML output, so we get multiple start/endDocuments). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1211) OpenDocument (ODF) parser produces multiple startDocument() events
[ https://issues.apache.org/jira/browse/TIKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850416#comment-13850416 ] Uwe Schindler commented on TIKA-1211: - There are multiple ways to fix this: - Make XHTMLContentHandler prevent multiple startDocument() events. I think thats easiest and most correct. XHTMLContentHandler already has some magic in there. - Add an additional contenthandler that removes subsequent startDocuments (this is the same as above, just in a separate handler) OpenDocument (ODF) parser produces multiple startDocument() events -- Key: TIKA-1211 URL: https://issues.apache.org/jira/browse/TIKA-1211 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Uwe Schindler Related to SOLR-4809: Solr receives multiple startDocument events when parsing OpenDocumentFiles. The parser already prevents multiple endDocuments, but not multiple startDocuments. The bug was introduced when we added parsing content.xml and meta.xml (TIKA-736, but both feed elements to the XHTML output, so we get multiple start/endDocuments). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (TIKA-1181) RTFParser not keeping HTML font colors and underscore tags.
[ https://issues.apache.org/jira/browse/TIKA-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788171#comment-13788171 ] Uwe Schindler commented on TIKA-1181: - Other parsers like OpenOffice do not preserve colors, too. RTFParser not keeping HTML font colors and underscore tags. --- Key: TIKA-1181 URL: https://issues.apache.org/jira/browse/TIKA-1181 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows server 2008 Reporter: Leo Labels: RTFParser Hi, I'm having problems with this code. It does not put the font colors and underscores u/u tags in the HTML from the RTF string. Is there anything I can do to put them there? Code: InputStream in = new ByteArrayInputStream(rtfString.getBytes(UTF-8)); org.apache.tika.parser.rtf.RTFParser parser = new org.apache.tika.parser.rtf.RTFParser(); Metadata metadata = new Metadata(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProperty(OutputKeys.METHOD, xml); handler.getTransformer().setOutputProperty(OutputKeys.INDENT, no); handler.setResult(new StreamResult(sw)); parser.parse(in, handler, metadata, new ParseContext()); String xhtml = sw.toString(); xhtml = xhtml.replaceAll(\r\n, br\r\n); Thanks for looking at it. Leo -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734769#comment-13734769 ] Uwe Schindler commented on TIKA-1134: - Hoss: I agree to fix this in the documentation. On the SOLR-4679 i explained in more details *why TIKA is doing this*: {quote} Let me recapitulate TIKA's problems: - TIKA decided to use XHTML as its output format to report the parsed documents to the consumer. This is nice, because it allows to preserve some of the formatting (like bold fonts, paragraphs,...) originating from the original document. Of course most of this formatting is lost, but you can still detect things like emphasized text. By choosing XHTML as output format, of course TIKA must use XHTML formatting for new lines and similar. So whenever a line break is needed, the TIKA pasrer emits a br/ tag or places the paragraph (in a PDF) inside a p/ element. As we all know, HTML ignores formatting like newlines, tabs,... (all are treated as one single whitespace, so means like this regreplace: {{s/\s+/ /}} - On the other hand, TIKA wants to make it simple for people to extract the *plain text* contents. With the XHTML-only approach this would be hard for the consumer. Because to add the correct newlines, the consumer has to fully understand XHTML and detect block elements and replace them by \n To support both usages of TIKA the idea was to embed this information which is unimportant to HTML (as HTML ignores whitespaces completely) as ignorableWhitespace as convenience for the user. A fully compliant XHTML consumer would not parse the ignoreable stuff. As it understands HTML it would detect a p element as a block element and format the output. Solr unfortunately has some strange approach: It is mainly interested in the text only contents, so ideally when consuming the HTLL it could use {{WriteoutContentHandler(StringBuilder, BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the right thing automatically: It would extract only text from the body element and would use the convenience whitespace to format the text in ASCII-ART-like way (using tabs, newlines,...) :-) Solr has a hybrid approach: It collects all into a content tag (which is similar to the above approcha), but the bug is that in contrast to TIKA's official WriteOutContentHandler it does not use the ignorable whitespace inserted for convenience. In addition TIKA also has a stack where it allows to process parts of the documents (like the title element or all em elements). In that case it has several StringBuilders in parallel that are populated with the contents. The problems are here too, but cannot be solved by using ignorable whitespace: e.g. one indexes only all em elements (which are inline HTML elements no block elements), there is no whitespace so all em elements would be glued together in the em field of your index... I just mention this, in my opinion the SolrContentHandler needs more work to correctly understand HTML and not just collect element names in a map! Now to your complaint: You proposed to report the newlines as real {{character()}} events - but this is not the right thing to do here. As I said, HTML does not know these characters, they are ignored. The formatting is done by the element names (like p, div, table). So the helper whitespace for text-only consumers should be inserted as ignorableWhitespace only, if we would add it to the real character data we would report things that every HTML parser (like nekohtml) would never report to the consumer. Nekohtml would also report this useless extra whitespace as ignorable. The convenience here is that TIKA's XHTMLContentHandler used by all parsers is configured to help the text-only user, but don't hurt the HTML-only user. This differentiation is done by reporting the HTML element names (p, div, table, th, td, tr, abbr, em, strong,...) but also report the ASCII-ART-text-only content like TABs indide tables, newlines after block elements,... This is always done as ignorableWhitespace (for convenience), a real HTML parser must ignore it - and its correct to do this. {quote} I think we should document this in the javadocs or the howto page, so implementors of ContentHandlers know what to do! ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as
[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733344#comment-13733344 ] Uwe Schindler commented on TIKA-1134: - Hi Hoss, the rule in TIKA is: - TIKA inserts ignoreableWhitespace to support plain-text extraction on block elements and br/ tags (which are also somehow empty block elements) - see TIKA-171. Nothing else will insert ignorableWhitespace into the content handler. This means, consumers that are only interested in the *plain text* contents of parsed files, should ignore all HTML syntax elements and just treat ignorableWhitespace as significant - this is what TextOnlyContentHandler does to extract text. This was decided in TIKA-171 long time ago. If you are interested in *structured* HTML output, use the XHTML elements and ignore the whitespace. ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as equivilent to ignorable whitespace containing a newline. This means that clients who ask Tika to parse files, and specify their own ContentHandler to capture the character data can get sequences of run-on text w/o knowing that the br tag was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as real whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace in the HTML markup. The crux of the problem seems to be: * instead of generating a startElement event for br the HtmlParser treats it as a xhtml.newline(). * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters SAX event ...either one of these by themselves might be fine, but in combination they don't really make any sense. If for example an actual newline exists in the html, it comes across as part of a characters SAX event, not as ignorbale whitespace. Changing the newline() function to delegate to characters(...) seems to solve the problem for br tags in HTML, but breaks several tests -- probably because the newline() function is also used to add intentionally add (synthetic) ignorableWhitespace events after elements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733348#comment-13733348 ] Uwe Schindler commented on TIKA-1134: - I think this issue is Won't fix. The issues described by Hoss are caused by user error :-) So maybe keep this open to make javadocs inside all those wrapper ContentHandlers like BodyContentHandler to explicitely state that those extract plain text and add extra whitespace to support this. ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as equivilent to ignorable whitespace containing a newline. This means that clients who ask Tika to parse files, and specify their own ContentHandler to capture the character data can get sequences of run-on text w/o knowing that the br tag was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as real whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace in the HTML markup. The crux of the problem seems to be: * instead of generating a startElement event for br the HtmlParser treats it as a xhtml.newline(). * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters SAX event ...either one of these by themselves might be fine, but in combination they don't really make any sense. If for example an actual newline exists in the html, it comes across as part of a characters SAX event, not as ignorbale whitespace. Changing the newline() function to delegate to characters(...) seems to solve the problem for br tags in HTML, but breaks several tests -- probably because the newline() function is also used to add intentionally add (synthetic) ignorableWhitespace events after elements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1145) classloaders issue loading resources when extending Tika
[ https://issues.apache.org/jira/browse/TIKA-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699793#comment-13699793 ] Uwe Schindler commented on TIKA-1145: - I think the main problem is ServiceLoader's definition. It uses context class loader to load SPIs which is in my opinion a bug in the spec. In Lucene we had the same problems with our own ServiceLoader impl that uses the Abstract class'/interface's classloader to loads its own implementations. See LUCENE-4713 for more info, where Lucene uses SPI to load its codecs and analyzers. classloaders issue loading resources when extending Tika Key: TIKA-1145 URL: https://issues.apache.org/jira/browse/TIKA-1145 Project: Tika Issue Type: Bug Components: config, mime Affects Versions: 1.3 Environment: Tika as part of standard Solr distribution Reporter: Maciej Lizewski I noticed that ServiceLoader is using different classloader when loading 'services' like Parsers, etc (java.net.FactoryURLClassLoader) than MimeTypesFactory (org.eclipse.jetty.webapp.WebAppClassLoader) when loading mime types definitions. As result - it works completely different: When jar with custom parser and custom-mimetypes.xml is added to solr.war - both resources are located and loaded (META-INF\services\org.apache.tika.parser.Parser and org\apache\tika\mime\custom-mimetypes.xml) and everything works fine. When jar with custom parser is in Solr core lib and configured in solrconfig.xml - only META-INF\services\org.apache.tika.parser.Parser is loaded, but custom-mimetypes.xml is ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1145) classloaders issue loading resources when extending Tika
[ https://issues.apache.org/jira/browse/TIKA-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699883#comment-13699883 ] Uwe Schindler commented on TIKA-1145: - OK, I misunderstood the original problem. If you pass the correct config's class loader everywhere where TIKA uses ServiceLoader or looks up resources otherwise, it should be fine. classloaders issue loading resources when extending Tika Key: TIKA-1145 URL: https://issues.apache.org/jira/browse/TIKA-1145 Project: Tika Issue Type: Bug Components: config, mime Affects Versions: 1.3 Environment: Tika as part of standard Solr distribution Reporter: Maciej Lizewski I noticed that ServiceLoader is using different classloader when loading 'services' like Parsers, etc (java.net.FactoryURLClassLoader) than MimeTypesFactory (org.eclipse.jetty.webapp.WebAppClassLoader) when loading mime types definitions. As result - it works completely different: When jar with custom parser and custom-mimetypes.xml is added to solr.war - both resources are located and loaded (META-INF\services\org.apache.tika.parser.Parser and org\apache\tika\mime\custom-mimetypes.xml) and everything works fine. When jar with custom parser is in Solr core lib and configured in solrconfig.xml - only META-INF\services\org.apache.tika.parser.Parser is loaded, but custom-mimetypes.xml is ignored. MimeTypesFactory ignores custom classLoader provided in TikaConfig and always using only context provided one: ClassLoader cl = MimeTypesReader.class.getClassLoader(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1145) classloaders issue loading resources when extending Tika
[ https://issues.apache.org/jira/browse/TIKA-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699888#comment-13699888 ] Uwe Schindler commented on TIKA-1145: - It is still strange that you see this behaviour: If both the JAR files of DIH and TIKA's JAR files and also your custom parsers are in SolrCore's lib folder, they share all the same classloader (the SolrCore's ResourceLoader's Classloader). Problems would only exist, if the TIKA and DIH classes are in the WAR file, but the custom parser is in the lib or conf dir of the Solr core. In that case the MimeTypesFactory is only loading the classes from its own class loader (which is the webapp's), not the Solr ResourceLoader. In any case, MimeTypesFactory should use the configured classloader. classloaders issue loading resources when extending Tika Key: TIKA-1145 URL: https://issues.apache.org/jira/browse/TIKA-1145 Project: Tika Issue Type: Bug Components: config, mime Affects Versions: 1.3 Environment: Tika as part of standard Solr distribution Reporter: Maciej Lizewski I noticed that ServiceLoader is using different classloader when loading 'services' like Parsers, etc (java.net.FactoryURLClassLoader) than MimeTypesFactory (org.eclipse.jetty.webapp.WebAppClassLoader) when loading mime types definitions. As result - it works completely different: When jar with custom parser and custom-mimetypes.xml is added to solr.war - both resources are located and loaded (META-INF\services\org.apache.tika.parser.Parser and org\apache\tika\mime\custom-mimetypes.xml) and everything works fine. When jar with custom parser is in Solr core lib and configured in solrconfig.xml - only META-INF\services\org.apache.tika.parser.Parser is loaded, but custom-mimetypes.xml is ignored. MimeTypesFactory ignores custom classLoader provided in TikaConfig and always using only context provided one: ClassLoader cl = MimeTypesReader.class.getClassLoader(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira