[jira] [Commented] (TIKA-2245) Standardise logging
[ https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698065#comment-17698065 ] Konstantin Gribov commented on TIKA-2245: - [~lfcnassif], yeah {{commons-logging}} must be excluded when using {{jcl-over-slf4j}}. In case of using Log4j2 bridge {{org.apache.logging.log4j:log4j-jcl}} it's opposite (must be present in the classpath) but still no need for explicit dependency, it would be brought transitively. {{jackcess}} should stay on {{commons-logging}} (at least without release with breaking change) but on our side we should have exclusion and {{jcl-over-slf4j}}. Not sure if it should be {{tika-parser-*-module}} level though. I'd prefer it in the {{tika-parsers-standard-package}} only. That way if advanced downstream user choose one of the fine-grained {{tika-parser-*-module}}s they add either {{jcl-over-slf4j}} or {{log4j-jcl}} to their classpath. And in case of more mainstream usage {{tika-parsers-standard-package}} brings convenient bridge without much hustle. I updated [Logging wiki page|https://cwiki.apache.org/confluence/display/TIKA/Logging] after 2.6.0 to more or less represent current state of affairs. Maybe I should migrate it to {{src/site}} in future. Confluence editor is so much pain in the arse when adding/editing code blocks if you have more than one on a wiki page.. > Standardise logging > --- > > Key: TIKA-2245 > URL: https://issues.apache.org/jira/browse/TIKA-2245 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14, 1.15 >Reporter: Matthew Caruana Galizia >Assignee: Konstantin Gribov >Priority: Major > Labels: logging > Fix For: 1.15 > > > Tika parsers sometimes use Log4j's Logger, sometimes the JUL > (java.util.logging) Logger and sometimes SLF4j. > It would be better to standardise on a single facade, for the sake of not > having to configure multiple loggers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3934) Reogranize POMs parent chain to avoid leaking dependency management downstream
[ https://issues.apache.org/jira/browse/TIKA-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636242#comment-17636242 ] Konstantin Gribov edited comment on TIKA-3934 at 11/19/22 10:31 PM: It seems that it doesn't if the dependency isn't used in the tika artifact in any way (including test dependencies). If I have import for {{org.apache.tika:tika-bom}} and add {{org.apache.tika:tika-core}} and {{io.netty:netty-buffer}} without versions both Maven and Gradle build will fail. On the other hand {{log4j-core}} version (and version constraint in Gradle case) leaks from {{tika-parent}} via {{tika-bom}}. Inconsistently in Maven case. ||Type||Use BOM||tika-core||log4j-core||Result|| |Maven|yes|-|-|log4j-api 2.19.0, log4j-core 2.19.0| |Maven|yes|-|2.18.0|log4j-api 2.19.0, log4j-core 2.18.0| |Maven|no|2.6.0|2.18.0|log4j-api 2.18.0, log4j-core 2.18.0| |Gradle|yes|-|-|log4j-api 2.19.0, log4j-core 2.19.0| |Gradle|yes|-|2.18.0|log4j-api 2.19.0, log4j-core 2.19.0| |Gradle|no|2.6.0|2.18.0|log4j-api 2.18.0, log4j-core 2.18.0| Test Maven project (run {{mvn package}} to see actual dependencies in the output): {code:xml|title=pom.xml} http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> 4.0.0 org.example bom-test 1.0-SNAPSHOT 17 17 UTF-8 org.apache.tika tika-bom 2.6.0 pom import org.apache.tika tika-core org.apache.logging.log4j log4j-core org.apache.maven.plugins maven-dependency-plugin 3.3.0 test package copy-dependencies ${project.build.directory}/deps {code} Gradle test project (run {{gradle dependencyInsight --dependency log4j}} or {{gradle dependencies --configuration rC}}): {code:groovy|title=settings.gradle.kts} dependencyResolutionManagement { repositories.mavenCentral() } {code} {code:groovy|title=build.gradle.kts} plugins { id("java-library") } dependencies { api(platform("org.apache.tika:tika-bom:2.6.0")) api("org.apache.tika:tika-core") implementation("org.apache.logging.log4j:log4j-core:2.18.0") } {code} was (Author: grossws): It seems that it doesn't, if I have import for {{org.apache.tika:tika-bom}} and add {{org.apache.tika:tika-core}} and {{io.netty:netty-buffer}} without versions both Maven and Gradle build will fail. On the other hand {{log4j-core}} version (and version constraint in Gradle case) leaks from {{tika-parent}} via {{tika-bom}}. ||Type||Use BOM||tika-core||log4j-core||Result|| |Maven|yes|-|-|log4j-api 2.19.0, log4j-core 2.19.0| |Maven|yes|-|2.18.0|log4j-api 2.19.0, log4j-core 2.18.0| |Maven|no|2.6.0.|2.18.0|log4j-api 2.18.0, log4j-core 2.18.0| |Gradle|yes|-|-|log4j-api 2.19.0, log4j-core 2.19.0| |Gradle|yes|-|2.18.0|log4j-api 2.19.0, log4j-core 2.19.0| |Gradle|no|2.6.0|2.18.0|log4j-api 2.18.0, log4j-core 2.18.0| Test Maven project (run {{mvn package}} to see actual dependencies in the output): {code:xml|title=pom.xml} http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> 4.0.0 org.example bom-test 1.0-SNAPSHOT 17 17 UTF-8 org.apache.tika tika-bom 2.6.0 pom import org.apache.tika tika-core org.apache.logging.log4j log4j-core org.apache.maven.plugins maven-dependency-plugin 3.3.0 test package copy-dependencies ${project.build.directory}/deps {code} Gradle test project (run {{gradle dependencyInsight --dependency log4j}} or {{gradle dependencies --configuration rC}}): {code:kotlin|title=settings.gradle.kts} dependencyResolutionManagement { repositories.mavenCentral() } {code} {code:kotlin|title=build.gradle.kts} plugins { `java-library` } dependencies { api(platform("org.apache.tika:tika-bom:2.6.0")) api("org.apache.tika:tika-core") implementation("org.apache.logging.log4j:log4j-core:2.18.0") } {code} > Reogranize POMs parent chain to avoid leaking dependency management downstream > -- > > Key: TIKA-3934 > URL:
[jira] [Commented] (TIKA-3934) Reogranize POMs parent chain to avoid leaking dependency management downstream
[ https://issues.apache.org/jira/browse/TIKA-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636242#comment-17636242 ] Konstantin Gribov commented on TIKA-3934: - It seems that it doesn't, if I have import for {{org.apache.tika:tika-bom}} and add {{org.apache.tika:tika-core}} and {{io.netty:netty-buffer}} without versions both Maven and Gradle build will fail. On the other hand {{log4j-core}} version (and version constraint in Gradle case) leaks from {{tika-parent}} via {{tika-bom}}. ||Type||Use BOM||tika-core||log4j-core||Result|| |Maven|yes|-|-|log4j-api 2.19.0, log4j-core 2.19.0| |Maven|yes|-|2.18.0|log4j-api 2.19.0, log4j-core 2.18.0| |Maven|no|2.6.0.|2.18.0|log4j-api 2.18.0, log4j-core 2.18.0| |Gradle|yes|-|-|log4j-api 2.19.0, log4j-core 2.19.0| |Gradle|yes|-|2.18.0|log4j-api 2.19.0, log4j-core 2.19.0| |Gradle|no|2.6.0|2.18.0|log4j-api 2.18.0, log4j-core 2.18.0| Test Maven project (run {{mvn package}} to see actual dependencies in the output): {code:xml|title=pom.xml} http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> 4.0.0 org.example bom-test 1.0-SNAPSHOT 17 17 UTF-8 org.apache.tika tika-bom 2.6.0 pom import org.apache.tika tika-core org.apache.logging.log4j log4j-core org.apache.maven.plugins maven-dependency-plugin 3.3.0 test package copy-dependencies ${project.build.directory}/deps {code} Gradle test project (run {{gradle dependencyInsight --dependency log4j}} or {{gradle dependencies --configuration rC}}): {code:kotlin|title=settings.gradle.kts} dependencyResolutionManagement { repositories.mavenCentral() } {code} {code:kotlin|title=build.gradle.kts} plugins { `java-library` } dependencies { api(platform("org.apache.tika:tika-bom:2.6.0")) api("org.apache.tika:tika-core") implementation("org.apache.logging.log4j:log4j-core:2.18.0") } {code} > Reogranize POMs parent chain to avoid leaking dependency management downstream > -- > > Key: TIKA-3934 > URL: https://issues.apache.org/jira/browse/TIKA-3934 > Project: Tika > Issue Type: Improvement > Components: depedency >Affects Versions: 2.6.0 >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.6.1, 2.7.0 > > > Tika's BOM (Bill of Materials) artifact has {{tika-parent}} as a parent POM > and thus forces a lot of dependency versions on downstream users. > For example if one use only PDF module there's no reason to force > Netty/Jetty/CXF/whatever versions. > I propose the following: > * make {{tika}} reactor depend on {{tika-parent}} and all other {{tika-*}} > modules on the reactor > * move all our dependency management and build related configuration to the > reactor ({{tika}} root project) > I've started these work last week and will publish first PR for review soon. > Moving parts from {{tika-parent}} to {{tika}} may take some time so little > steps without build disruption is a must -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3934) Reogranize POMs parent chain to avoid leaking dependency management downstream
[ https://issues.apache.org/jira/browse/TIKA-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636239#comment-17636239 ] Konstantin Gribov commented on TIKA-3934: - I need to recheck if Maven inherits parent dependencyManagement via imported BOM. Maybe this issue is invalid > Reogranize POMs parent chain to avoid leaking dependency management downstream > -- > > Key: TIKA-3934 > URL: https://issues.apache.org/jira/browse/TIKA-3934 > Project: Tika > Issue Type: Improvement > Components: depedency >Affects Versions: 2.6.0 >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.6.1, 2.7.0 > > > Tika's BOM (Bill of Materials) artifact has {{tika-parent}} as a parent POM > and thus forces a lot of dependency versions on downstream users. > For example if one use only PDF module there's no reason to force > Netty/Jetty/CXF/whatever versions. > I propose the following: > * make {{tika}} reactor depend on {{tika-parent}} and all other {{tika-*}} > modules on the reactor > * move all our dependency management and build related configuration to the > reactor ({{tika}} root project) > I've started these work last week and will publish first PR for review soon. > Moving parts from {{tika-parent}} to {{tika}} may take some time so little > steps without build disruption is a must -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3735) Require Java 11 for 2.x at some point
[ https://issues.apache.org/jira/browse/TIKA-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636238#comment-17636238 ] Konstantin Gribov commented on TIKA-3735: - Another thing that comes to mind that we could have different required JDK version for Tika downstream consumers and to build Tika itself (including tests). Maybe even for some modules that are for internal usage if we can consider any module internal > Require Java 11 for 2.x at some point > - > > Key: TIKA-3735 > URL: https://issues.apache.org/jira/browse/TIKA-3735 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > This follows on from discussion we had on the user/dev list for when we want > to require Java 11. I think the consensus was: wait until we have to. > The following libraries require > Java 8 at the moment. I don't think > updating any of these is critical, but I do want to document where we're > stuck. > We can modify/edit this list as necessary: > * Apache OpenNLP 2.0.0 requires Java 11. > * DL4J 1.0.0-M2.1 - datavec-data-image-1.0.0-M2.1.jar requires Java 11 > * Lucene 9.x -- used in tika-eval > * icu4j -- we can't upgrade past 62.2 (April 2019) because that is the latest > version that is compatible with Lucene 8.11.1 > (https://github.com/apache/tika/pull/587) > * mime4j -- the last 2 (or three?) releases have been accidentally built with > Java 9 without the correct release=8. This should be fixed in the next > release. > * Fakeload > * > [checkstyle|https://mail.google.com/mail/u/0/#label/lists%2Ftika/WhctKKXXHvjnJRRdBSwLbKkDkXQtRnWGDhblVMQQZhjsDGrFpRMRQJJrZSdskrNCqcmTtjL] > * errorprone requires Java 11 for the build (doesn't mean we can't target 8) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-3175) Upgrade version of TPS: commons-io
[ https://issues.apache.org/jira/browse/TIKA-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-3175. --- > Upgrade version of TPS: commons-io > -- > > Key: TIKA-3175 > URL: https://issues.apache.org/jira/browse/TIKA-3175 > Project: Tika > Issue Type: Bug >Affects Versions: 1.23, 1.24, 1.24.1 >Reporter: Shubhangi Raut >Priority: Critical > > Latest tika-bundle jars use commons-io-1.26.jar in them. > There is a vulnerability reported for commons-io-2.6.jar which is fixed in > version 2.7. > Details can be found in the following link: > Project: https://issues.apache.org/jira/browse/IO-559 > > Please upgrade the version for commons-io to 2.7 in next release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3175) Upgrade version of TPS: commons-io
[ https://issues.apache.org/jira/browse/TIKA-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3175. - Resolution: Duplicate > Upgrade version of TPS: commons-io > -- > > Key: TIKA-3175 > URL: https://issues.apache.org/jira/browse/TIKA-3175 > Project: Tika > Issue Type: Bug >Affects Versions: 1.23, 1.24, 1.24.1 >Reporter: Shubhangi Raut >Priority: Critical > > Latest tika-bundle jars use commons-io-1.26.jar in them. > There is a vulnerability reported for commons-io-2.6.jar which is fixed in > version 2.7. > Details can be found in the following link: > Project: https://issues.apache.org/jira/browse/IO-559 > > Please upgrade the version for commons-io to 2.7 in next release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3387) Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser
[ https://issues.apache.org/jira/browse/TIKA-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3387. - Resolution: Incomplete Please feel free to reopen the issue if it can be reproduced with more recent Tika version (2.6.0 at the moment) and you could provide a bit more info > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser > --- > > Key: TIKA-3387 > URL: https://issues.apache.org/jira/browse/TIKA-3387 > Project: Tika > Issue Type: Bug > Components: parser > Environment: dev testing >Reporter: Manojkumar M >Priority: Critical > > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@7b6359a0 > > > This is the only exception trace we are getting in the code. > > This is what is put in the pom.xml > > <*dependency*> > <*groupId*>org.apache.tika > <*artifactId*>tika-core > > <*dependency*> > <*groupId*>org.apache.tika > <*artifactId*>tika-parsers > <*exclusions*> > <*exclusion*> > <*groupId*>com.fasterxml.jackson.core > <*artifactId*>jackson-core > > <*exclusion*> > <*groupId*>com.fasterxml.jackson.core > <*artifactId*>jackson-annotations > > > > {color:#FF}*Version*{color} > tika-parsers: 1.24.1 > poi-ooxml: 4.1.2 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3712) update jackson-databind to 2.13.2.1 or greater in tika jars
[ https://issues.apache.org/jira/browse/TIKA-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3712. - Resolution: Fixed > update jackson-databind to 2.13.2.1 or greater in tika jars > --- > > Key: TIKA-3712 > URL: https://issues.apache.org/jira/browse/TIKA-3712 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 2.3.0 >Reporter: Dhoka Pramod >Priority: Critical > Fix For: 2.4.1 > > > [com.fasterxml.jackson.core_jackson-databind_2.13.1|https://austsbldci-res.lab.opentext.com/static-files/FKgXaaJSguhZ4lO6UfpswhoSmhYTiF2UyQU-rrbduGUxNjQ4NzM4OTIzNDgzOjg6aHNjaGVpYm46dmlldy9UZWFtU2l0ZS9qb2IvRG9ja2VySW1hZ2UtVFMyMi4yL2xhc3RTdWNjZXNzZnVsQnVpbGQvYXJ0aWZhY3Q=/twistlock-report.html#sha256:55f19c5712346e29554e65473ac7c1ef988a2ae2fe1ffa71035426183d4ad4e9_com.fasterxml.jackson.core_jackson-databind_2.13.1] > in tika eval app is of version 2.13.1 which has > [CVE-2020-36518|https://nvd.nist.gov/vuln/detail/CVE-2020-36518] > vulnerability. > jackson databind jars needs to be updated to *2.13.2.1 or greater.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3712) update jackson-databind to 2.13.2.1 or greater in tika jars
[ https://issues.apache.org/jira/browse/TIKA-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3712: Fix Version/s: 2.4.1 > update jackson-databind to 2.13.2.1 or greater in tika jars > --- > > Key: TIKA-3712 > URL: https://issues.apache.org/jira/browse/TIKA-3712 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 2.3.0 >Reporter: Dhoka Pramod >Priority: Critical > Fix For: 2.4.1 > > > [com.fasterxml.jackson.core_jackson-databind_2.13.1|https://austsbldci-res.lab.opentext.com/static-files/FKgXaaJSguhZ4lO6UfpswhoSmhYTiF2UyQU-rrbduGUxNjQ4NzM4OTIzNDgzOjg6aHNjaGVpYm46dmlldy9UZWFtU2l0ZS9qb2IvRG9ja2VySW1hZ2UtVFMyMi4yL2xhc3RTdWNjZXNzZnVsQnVpbGQvYXJ0aWZhY3Q=/twistlock-report.html#sha256:55f19c5712346e29554e65473ac7c1ef988a2ae2fe1ffa71035426183d4ad4e9_com.fasterxml.jackson.core_jackson-databind_2.13.1] > in tika eval app is of version 2.13.1 which has > [CVE-2020-36518|https://nvd.nist.gov/vuln/detail/CVE-2020-36518] > vulnerability. > jackson databind jars needs to be updated to *2.13.2.1 or greater.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3935) Remove log4j 1.2.x from dependencies
Konstantin Gribov created TIKA-3935: --- Summary: Remove log4j 1.2.x from dependencies Key: TIKA-3935 URL: https://issues.apache.org/jira/browse/TIKA-3935 Project: Tika Issue Type: Task Components: depedency Affects Versions: 2.6.0 Reporter: Konstantin Gribov Assignee: Konstantin Gribov Fix For: 2.6.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3324) Add checkstyle checker
[ https://issues.apache.org/jira/browse/TIKA-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636228#comment-17636228 ] Konstantin Gribov commented on TIKA-3324: - I certainly lost against checkstyle plugin. When I just run {{mvn checkstyle:checkstyle}} it fails on {{tika-core}} with something like 5.7k errors. What do you think about using [spotless|https://github.com/diffplug/spotless]? It supports [ratchet|https://github.com/diffplug/spotless/tree/main/plugin-gradle#ratchet] mode to avoid reformatting all files at once and to force reformat only on changed files. I'm going to experiment with it in a separate branch for POMs at first. > Add checkstyle checker > -- > > Key: TIKA-3324 > URL: https://issues.apache.org/jira/browse/TIKA-3324 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > > I _think_ we can introduce this gently at first. And slowly fix files as time > allows. Obv, we can hope a bulk fix will work, and it won’t be much > effort... WDYT? > > H/T [~ndipiazza_gmail] for the recommendation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3934) Reogranize POMs parent chain to avoid leaking dependency management downstream
[ https://issues.apache.org/jira/browse/TIKA-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3934: Fix Version/s: 2.6.1 > Reogranize POMs parent chain to avoid leaking dependency management downstream > -- > > Key: TIKA-3934 > URL: https://issues.apache.org/jira/browse/TIKA-3934 > Project: Tika > Issue Type: Improvement > Components: depedency >Affects Versions: 2.6.0 >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.6.1, 2.7.0 > > > Tika's BOM (Bill of Materials) artifact has {{tika-parent}} as a parent POM > and thus forces a lot of dependency versions on downstream users. > For example if one use only PDF module there's no reason to force > Netty/Jetty/CXF/whatever versions. > I propose the following: > * make {{tika}} reactor depend on {{tika-parent}} and all other {{tika-*}} > modules on the reactor > * move all our dependency management and build related configuration to the > reactor ({{tika}} root project) > I've started these work last week and will publish first PR for review soon. > Moving parts from {{tika-parent}} to {{tika}} may take some time so little > steps without build disruption is a must -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3934) Reogranize POMs parent chain to avoid leaking dependency management downstream
[ https://issues.apache.org/jira/browse/TIKA-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3934: Description: Tika's BOM (Bill of Materials) artifact has {{tika-parent}} as a parent POM and thus forces a lot of dependency versions on downstream users. For example if one use only PDF module there's no reason to force Netty/Jetty/CXF/whatever versions. I propose the following: * make {{tika}} reactor depend on {{tika-parent}} and all other {{tika-*}} modules on the reactor * move all our dependency management and build related configuration to the reactor ({{tika}} root project) I've started these work last week and will publish first PR for review soon. Moving parts from {{tika-parent}} to {{tika}} may take some time so little steps without build disruption is a must was: Tika's BOM (Bill of Materials) artifact has {{tika-parent}} as a parent POM and thus forces a lot of dependency versions on downstream users. For example if one use only PDF module there's no reason to force Netty/Jetty/CXF/whatever versions. I propose the following: * move all our dependency management and build related configuration to the reactor ({{tika}} root project) * make {{tika}} rector depend on {{tika-parent}} and all other {{tika-*}} modules on the reactor I've started these work last week and will publish first PR for review soon. Moving parts from {{tika-parent}} to {{tika}} may take some time so little steps without build disruption is a must > Reogranize POMs parent chain to avoid leaking dependency management downstream > -- > > Key: TIKA-3934 > URL: https://issues.apache.org/jira/browse/TIKA-3934 > Project: Tika > Issue Type: Improvement > Components: depedency >Affects Versions: 2.6.0 >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.7.0 > > > Tika's BOM (Bill of Materials) artifact has {{tika-parent}} as a parent POM > and thus forces a lot of dependency versions on downstream users. > For example if one use only PDF module there's no reason to force > Netty/Jetty/CXF/whatever versions. > I propose the following: > * make {{tika}} reactor depend on {{tika-parent}} and all other {{tika-*}} > modules on the reactor > * move all our dependency management and build related configuration to the > reactor ({{tika}} root project) > I've started these work last week and will publish first PR for review soon. > Moving parts from {{tika-parent}} to {{tika}} may take some time so little > steps without build disruption is a must -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3934) Reogranize POMs parent chain to avoid leaking dependency management downstream
Konstantin Gribov created TIKA-3934: --- Summary: Reogranize POMs parent chain to avoid leaking dependency management downstream Key: TIKA-3934 URL: https://issues.apache.org/jira/browse/TIKA-3934 Project: Tika Issue Type: Improvement Components: depedency Affects Versions: 2.6.0 Reporter: Konstantin Gribov Assignee: Konstantin Gribov Fix For: 2.7.0 Tika's BOM (Bill of Materials) artifact has {{tika-parent}} as a parent POM and thus forces a lot of dependency versions on downstream users. For example if one use only PDF module there's no reason to force Netty/Jetty/CXF/whatever versions. I propose the following: * move all our dependency management and build related configuration to the reactor ({{tika}} root project) * make {{tika}} rector depend on {{tika-parent}} and all other {{tika-*}} modules on the reactor I've started these work last week and will publish first PR for review soon. Moving parts from {{tika-parent}} to {{tika}} may take some time so little steps without build disruption is a must -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)
[ https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-3368. --- > Add Bill of Materials (BOM) artifact (Tika 1.x) > --- > > Key: TIKA-3368 > URL: https://issues.apache.org/jira/browse/TIKA-3368 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 1.27 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)
[ https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3368. - Resolution: Invalid Tika 1.x reached EOL and PR was closed some time ago, just a JIRA cleanup > Add Bill of Materials (BOM) artifact (Tika 1.x) > --- > > Key: TIKA-3368 > URL: https://issues.apache.org/jira/browse/TIKA-3368 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 1.27 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)
[ https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3368: Fix Version/s: 1.27 (was: 2.0.0-BETA) > Add Bill of Materials (BOM) artifact (Tika 1.x) > --- > > Key: TIKA-3368 > URL: https://issues.apache.org/jira/browse/TIKA-3368 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 1.27 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3367: Fix Version/s: 2.3.0 (was: 2.1.0) > Add Bill of Materials (BOM) artifact > > > Key: TIKA-3367 > URL: https://issues.apache.org/jira/browse/TIKA-3367 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3367) Add Bill of Materials (BOM) artifact
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3367. - Resolution: Fixed > Add Bill of Materials (BOM) artifact > > > Key: TIKA-3367 > URL: https://issues.apache.org/jira/browse/TIKA-3367 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-3367) Add Bill of Materials (BOM) artifact
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-3367. --- > Add Bill of Materials (BOM) artifact > > > Key: TIKA-3367 > URL: https://issues.apache.org/jira/browse/TIKA-3367 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents
[ https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635951#comment-17635951 ] Konstantin Gribov commented on TIKA-3493: - Just hit the same with one of the tests failing. I looked through RTF spec 1.9 and they effectively have local date/time (just wallclock without time zone) there. Right now it's interpreted as date/time in current jvm timezone. Both LibreOffice and Word (on Mac) interpret them the same. Maybe we should keep it without timezone in the metadata string (in {{dcterms:created}} or another property) and only reinterpret it with a TZ in {{Metadata#getDate}} but it would be a breaking change. Or if we can keep raw representation plus Tika's best guess what instant it meant. Likely to require breaking changes too. > dcterms:created date depends on the current TimeZone in RTF documents > - > > Key: TIKA-3493 > URL: https://issues.apache.org/jira/browse/TIKA-3493 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0 >Reporter: David Pilato >Assignee: Tim Allison >Priority: Minor > Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch > > > {color:#33}I'm migrating an existing project to Tika 2.0.0. > I'm seeing a strange behavior. > TL;DR: the created date of the document changes depending on the timezone. > Long story: > I have a unit test which extracts content and metadata from a [RTF > document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]]. > When using Tika 1.27, whatever the timezone defined for my JVM, I'm always > getting the same value for "dcterms:created": "2016-07-07T13:38:00Z". > When running the same test with Tika 2.0.0, the date changes depending on the > Timezone. > For example: > {color} > * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z > {color} > * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z > {color} > * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z > {color} > > {color:#33}I don't know if it's a bug or expected. May be the RTF format > does not specify the Timezone. > I'm surprised that I don't see the same behavior for Office documents > actually. > {color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3906) Build a new version of the Tika docker image to fix CVEs
[ https://issues.apache.org/jira/browse/TIKA-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17625204#comment-17625204 ] Konstantin Gribov commented on TIKA-3906: - +1 on such versioning scheme, it should be transparent enough for the downstream users > Build a new version of the Tika docker image to fix CVEs > > > Key: TIKA-3906 > URL: https://issues.apache.org/jira/browse/TIKA-3906 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.5.0 >Reporter: Felix Sperling >Priority: Major > > Please rebuild and release a new version of the 2.5.0 docker image. > The current one contains CVEs which have fixes already in the jammy repos. > h2. zlib > *_Note:_* _Versions mentioned in the description apply to the upstream > {{zlib}} package._ _See {{How to fix?}} for {{Ubuntu:22.04}} relevant > versions._ > zlib through 1.2.12 has a heap-based buffer over-read or buffer overflow in > inflate in inflate.c via a large gzip header extra field. NOTE: only > applications that call inflateGetHeader are affected. Some common > applications bundle the affected zlib source code but may be unable to call > inflateGetHeader (e.g., see the nodejs/node reference). > h2. Remediation > Upgrade {{Ubuntu:22.04}} {{zlib}} to version 1:1.2.11.dfsg-2ubuntu9.2 or > higher. > > h2. perl > *_Note:_* _Versions mentioned in the description apply to the upstream > {{perl}} package._ _See {{How to fix?}} for {{Ubuntu:22.04}} relevant > versions._ > CPAN 2.28 allows Signature Verification Bypass. > h2. Remediation > Upgrade {{Ubuntu:22.04}} {{perl}} to version 5.34.0-3ubuntu1.1 or higher. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM
[ https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484950#comment-17484950 ] Konstantin Gribov commented on TIKA-3666: - When I looked into MS AD RMS some time ago it wasn't supported in Apache POI unfortunately. AFAIK POI 5.2.0 still doesn't support it. I'm not sure if support should be added there first or if some support could be added to Tika. Anyway some test files are must have. > Detect and indicate file encrypted with Rights Management Service RMS/IRM > - > > Key: TIKA-3666 > URL: https://issues.apache.org/jira/browse/TIKA-3666 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: August Valera >Priority: Major > > Rights Management Service (RMS), implemented in MS Office as Information > Rights Management (IRM), allows organizations to set file permissions that > are stored within the file. In most cases, this will result in the file > getting a new extension (with a prefix p, such as {{.txt}} becoming > {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support > this natively, the implementation results in the file contents being > encrypted without any extension change. > h4. Current behavior > Running such files through Tika produces results as if it was an empty file > ran through {{DefaultParser}} and {{{}OfficeParser{}}}. > h4. Expected behavior > Extract more metadata about necessary permissions to view (if possible), and > throwing {{EncryptedDocumentException}} as is the case with Office files > encrypted in the more traditional manner. > Reference: > [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (TIKA-3631) Upgrade log4j 2 to version 2.17.0 in tika
[ https://issues.apache.org/jira/browse/TIKA-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3631. - Resolution: Fixed > Upgrade log4j 2 to version 2.17.0 in tika > - > > Key: TIKA-3631 > URL: https://issues.apache.org/jira/browse/TIKA-3631 > Project: Tika > Issue Type: Improvement > Components: tika-server >Affects Versions: 2.2.0 >Reporter: Dhoka Pramod >Priority: Critical > Fix For: 2.2.1 > > > Tika 2.2.0 is still using log4j 2.15 which have few vulnerabilities. Hence we > need log4j in tika to be updated to 2.17. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3632) Log4j appears to be running in a Servlet environment, but there's no log4j-web module available
[ https://issues.apache.org/jira/browse/TIKA-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463513#comment-17463513 ] Konstantin Gribov commented on TIKA-3632: - I'll look into it. Seems it should be added from first glance > Log4j appears to be running in a Servlet environment, but there's no > log4j-web module available > --- > > Key: TIKA-3632 > URL: https://issues.apache.org/jira/browse/TIKA-3632 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.2.0 > Environment: Windows 10 >Reporter: Josh Burchard >Assignee: Konstantin Gribov >Priority: Minor > > I noticed the following issue when running the Tika server jar and trying to > troubleshoot log4j2 (with -Dlog4j2.debug set in the JVM): > {{INFO StatusLogger Log4j appears to be running in a Servlet environment, but > there's no log4j-web module available. If you want better web container > support, please add the log4j-web JAR to your web archive or server lib > directory.}} > Is this something that needs to be added when the server jar is built? It's > not _obviously_ impacting me right now but since it's a bit noisy (prints out > eight times) I attempted to quash the noise by downloading the > log4j-web-2.17.0.jar and add it to my classpath. Unfortunately that did > nothing. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (TIKA-3632) Log4j appears to be running in a Servlet environment, but there's no log4j-web module available
[ https://issues.apache.org/jira/browse/TIKA-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reassigned TIKA-3632: --- Assignee: Konstantin Gribov > Log4j appears to be running in a Servlet environment, but there's no > log4j-web module available > --- > > Key: TIKA-3632 > URL: https://issues.apache.org/jira/browse/TIKA-3632 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 2.2.0 > Environment: Windows 10 >Reporter: Josh Burchard >Assignee: Konstantin Gribov >Priority: Minor > > I noticed the following issue when running the Tika server jar and trying to > troubleshoot log4j2 (with -Dlog4j2.debug set in the JVM): > {{INFO StatusLogger Log4j appears to be running in a Servlet environment, but > there's no log4j-web module available. If you want better web container > support, please add the log4j-web JAR to your web archive or server lib > directory.}} > Is this something that needs to be added when the server jar is built? It's > not _obviously_ impacting me right now but since it's a bit noisy (prints out > eight times) I attempted to quash the noise by downloading the > log4j-web-2.17.0.jar and add it to my classpath. Unfortunately that did > nothing. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3628) Is tika 2.2 available
[ https://issues.apache.org/jira/browse/TIKA-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462557#comment-17462557 ] Konstantin Gribov commented on TIKA-3628: - Great! For gradle-related help beside docs I highly recommend [Gradle Community Slack|https://gradle-community.slack.com/] #community-support channel. > Is tika 2.2 available > - > > Key: TIKA-3628 > URL: https://issues.apache.org/jira/browse/TIKA-3628 > Project: Tika > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Vamsi Molli >Priority: Major > > As per [https://tika.apache.org/] . Tika has released the 2.2 version. > When trying to upgrade from 2.1.0 to 2.2 getting the following error. > Could not resolve org.apache.tika:tika-core:2.2.0. > [group: 'org.apache.tika', name: 'tika-core', version: '2.2.0'], > [group: 'org.apache.tika', name: 'tika-parsers-standard-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-microsoft-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-sqlite3-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-scientific-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-zip-commons', > version: '2.1.0'], > I see only tika-core upgraded to 2.2.0 rest are seeing 2.1.0 only as per > (https://mvnrepository.com/artifact/org.apache.tika). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3628) Is tika 2.2 available
[ https://issues.apache.org/jira/browse/TIKA-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462541#comment-17462541 ] Konstantin Gribov commented on TIKA-3628: - {quote}No cached version of org.apache.tika:tika-core:2.2.0 available for offline mode{quote} shows what's the problem. You have offline mode on, it allows Gradle to only use dependencies in local gradle cache. Remove {{--offline}} when running gradle (or uncheck offline mode in IDE if you see the issue there). > Is tika 2.2 available > - > > Key: TIKA-3628 > URL: https://issues.apache.org/jira/browse/TIKA-3628 > Project: Tika > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Vamsi Molli >Priority: Major > > As per [https://tika.apache.org/] . Tika has released the 2.2 version. > When trying to upgrade from 2.1.0 to 2.2 getting the following error. > Could not resolve org.apache.tika:tika-core:2.2.0. > [group: 'org.apache.tika', name: 'tika-core', version: '2.2.0'], > [group: 'org.apache.tika', name: 'tika-parsers-standard-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-microsoft-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-sqlite3-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-scientific-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-zip-commons', > version: '2.1.0'], > I see only tika-core upgraded to 2.2.0 rest are seeing 2.1.0 only as per > (https://mvnrepository.com/artifact/org.apache.tika). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3628) Is tika 2.2 available
[ https://issues.apache.org/jira/browse/TIKA-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462491#comment-17462491 ] Konstantin Gribov commented on TIKA-3628: - Yeah, I get it that you want to upgrade. 2.2.0 is available from Central (which is the primary maven repo). How you upgrade depends on your build system. Since you didn't specified what do you use I can only give generic advice: change relevant version numbers from 2.1.0 to 2.2.0 in your build definition (like pom.xml, build.gradle[.kts], *.project.clj or something else). Excluding httpcomponents also depends on your build system but most likely you would want to just select a different version. Look for you build system documentation how to do this. > Is tika 2.2 available > - > > Key: TIKA-3628 > URL: https://issues.apache.org/jira/browse/TIKA-3628 > Project: Tika > Issue Type: New Feature > Components: build >Affects Versions: 2.2.0 >Reporter: Vamsi Molli >Priority: Major > Fix For: 2.1.0 > > > As per [https://tika.apache.org/] . Tika has released the 2.2 version. > When trying to upgrade from 2.1.0 to 2.2 getting the following error. > Could not resolve org.apache.tika:tika-core:2.2.0. > [group: 'org.apache.tika', name: 'tika-core', version: '2.2.0'], > [group: 'org.apache.tika', name: 'tika-parsers-standard-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-microsoft-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-sqlite3-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-scientific-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-zip-commons', > version: '2.1.0'], > I see only tika-core upgraded to 2.2.0 rest are seeing 2.1.0 only as per > (https://mvnrepository.com/artifact/org.apache.tika). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3628) Is tika 2.2 available
[ https://issues.apache.org/jira/browse/TIKA-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462482#comment-17462482 ] Konstantin Gribov commented on TIKA-3628: - Maven Central has Tika 2.2.0: https://search.maven.org/search?q=g:org.apache.tika. I see that mvnrepository shows mix between 2.2.0 and 2.1.0 as last version, I guess it's still syncing from Central. What repository do you use? > Is tika 2.2 available > - > > Key: TIKA-3628 > URL: https://issues.apache.org/jira/browse/TIKA-3628 > Project: Tika > Issue Type: New Feature > Components: build >Affects Versions: 2.2.0 >Reporter: Vamsi Molli >Priority: Major > Fix For: 2.1.0 > > > As per [https://tika.apache.org/] . Tika has released the 2.2 version. > When trying to upgrade from 2.1.0 to 2.2 getting the following error. > Could not resolve org.apache.tika:tika-core:2.2.0. > [group: 'org.apache.tika', name: 'tika-core', version: '2.2.0'], > [group: 'org.apache.tika', name: 'tika-parsers-standard-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-microsoft-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-sqlite3-package', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-scientific-module', > version: '2.1.0'], > [group: 'org.apache.tika', name: 'tika-parser-zip-commons', > version: '2.1.0'], > I see only tika-core upgraded to 2.2.0 rest are seeing 2.1.0 only as per > (https://mvnrepository.com/artifact/org.apache.tika). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3623) Upgrade log4j to 2.16.0
[ https://issues.apache.org/jira/browse/TIKA-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3623: Priority: Blocker (was: Major) > Upgrade log4j to 2.16.0 > --- > > Key: TIKA-3623 > URL: https://issues.apache.org/jira/browse/TIKA-3623 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Blocker > Fix For: 1.28, 2.2.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3623) Upgrade log4j to 2.16.0
[ https://issues.apache.org/jira/browse/TIKA-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3623: Summary: Upgrade log4j to 2.16.0 (was: Upgrade log4j to 2.0.16) > Upgrade log4j to 2.16.0 > --- > > Key: TIKA-3623 > URL: https://issues.apache.org/jira/browse/TIKA-3623 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 1.28, 2.2.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3616) Upgrade log4j2 to 2.15.0
[ https://issues.apache.org/jira/browse/TIKA-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3616: Summary: Upgrade log4j2 to 2.15.0 (was: Upgrade log4j2 to 2.0.15) > Upgrade log4j2 to 2.15.0 > > > Key: TIKA-3616 > URL: https://issues.apache.org/jira/browse/TIKA-3616 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Blocker > Fix For: 2.2.0 > > > RCE...might be difficult to trigger in Tika, but why ask for a PoC... > This only affects 2.x. We were still using the old log4j in 1.x -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3616) Upgrade log4j2
[ https://issues.apache.org/jira/browse/TIKA-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460168#comment-17460168 ] Konstantin Gribov commented on TIKA-3616: - I looked a bit how Tika and it's upstream dependencies use {{MDC}}/{{ThreadContext}} which are vulnerable in 2.15 and Tika and deps use them quite sparsely (as far as IntelliJ IDEA sees usages). {{solrj}} puts Solr client URL into MDC, Zookeeper puts node id from config file into MDC and UIMA puts some ids into it which doesn't seem to be user-generated at least in Tika. Also {{testcontainers}} use MDC but only in {{test}} scope. > Upgrade log4j2 > -- > > Key: TIKA-3616 > URL: https://issues.apache.org/jira/browse/TIKA-3616 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.1.1 > > > RCE...might be difficult to trigger in Tika, but why ask for a PoC... > This only affects 2.x. We were still using the old log4j in 1.x -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3367: Fix Version/s: (was: 2.0.0-BETA) 2.0.1 > Add Bill of Materials (BOM) artifact > > > Key: TIKA-3367 > URL: https://issues.apache.org/jira/browse/TIKA-3367 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3164: Issue Type: Task (was: Bug) > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3312) Support Log4j2 jar in Tika-app.jar
[ https://issues.apache.org/jira/browse/TIKA-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331234#comment-17331234 ] Konstantin Gribov commented on TIKA-3312: - [~tallison], will do. I took a peek into it right now and found couple of things that I'd like to change in dependencies but it would required thoughtful and attentive approach not to break something ,) > Support Log4j2 jar in Tika-app.jar > -- > > Key: TIKA-3312 > URL: https://issues.apache.org/jira/browse/TIKA-3312 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.22, 1.24.1 >Reporter: Charushila Nanekar >Priority: Critical > > Latest version of Tika-app is using older version of Log4j jar which cause an > issue when Tika-app get integrated with other 3rd Party Application which > using latest log4j 2 jar. > Additionally, Apache Log4j 2 is an upgrade to Log4j that provides significant > improvements over its predecessor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-3149. --- > Tikka 1.18 not working with tess4j 3.4.8 on linux > - > > Key: TIKA-3149 > URL: https://issues.apache.org/jira/browse/TIKA-3149 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 > Environment: linux and deployedo n weblogic >Reporter: Vishakha >Assignee: Konstantin Gribov >Priority: Blocker > Labels: starter > > I am using tikka 1.18 version to parse the docuemtn content. It is working > independently when deployed on linux but it is not working. If tessract is > used before it. It is giving below error while parseTostring > code : > Tika tika = new Tika();Tika tika = new Tika(); > try(InputStream stream = new > FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) > { String documentExt = > tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); > String outputStr = tika.parseToString(stream); > String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " > +tempStr); } > catch (TikaException e) \{ > // TODO Auto-generated catch block _Logger.error("Error :",e); } > Error as : > java.lang.StackOverflowError > at > org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > ... > > > kindly let us know the solution -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3149: Description: I am using tikka 1.18 version to parse the docuemtn content. It is working independently when deployed on linux but it is not working. If tessract is used before it. It is giving below error while parseTostring code : Tika tika = new Tika();Tika tika = new Tika(); try(InputStream stream = new FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) { String documentExt = tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); String outputStr = tika.parseToString(stream); String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " +tempStr); } catch (TikaException e) \{ // TODO Auto-generated catch block _Logger.error("Error :",e); } Error as : java.lang.StackOverflowError at org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) ... > kindly let us know the solution was: I am using tikka 1.18 version to parse the docuemtn content. It is working independently when deployed on linux but it is not working. If tessract is used before it. It is giving below error while parseTostring code : Tika tika = new Tika();Tika tika = new Tika(); try(InputStream stream = new FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) { String documentExt = tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); String outputStr = tika.parseToString(stream); String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " +tempStr); } catch (TikaException e) \{ // TODO Auto-generated catch block _Logger.error("Error :",e); } Error as : java.lang.StackOverflowError at org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at
[jira] [Resolved] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3149. - Assignee: Konstantin Gribov Resolution: Not A Bug > Tikka 1.18 not working with tess4j 3.4.8 on linux > - > > Key: TIKA-3149 > URL: https://issues.apache.org/jira/browse/TIKA-3149 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 > Environment: linux and deployedo n weblogic >Reporter: Vishakha >Assignee: Konstantin Gribov >Priority: Blocker > Labels: starter > > I am using tikka 1.18 version to parse the docuemtn content. It is working > independently when deployed on linux but it is not working. If tessract is > used before it. It is giving below error while parseTostring > code : > Tika tika = new Tika();Tika tika = new Tika(); > try(InputStream stream = new > FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) > { String documentExt = > tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); > String outputStr = tika.parseToString(stream); > String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " > +tempStr); } > catch (TikaException e) \{ > // TODO Auto-generated catch block _Logger.error("Error :",e); } > Error as : > java.lang.StackOverflowError > at > org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > ... > > > kindly let us know the solution -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331122#comment-17331122 ] Konstantin Gribov commented on TIKA-3149: - You have both slf4j-jdk14 (logger implementation using java.util.Logging) and jul-to-slf4j (bridge to redirect java.util.Logging to slf4j-api). I recommend to drop slf4j-jdk14 from classpath and use any other logging implementation (logback-classic, log4j2). > Tikka 1.18 not working with tess4j 3.4.8 on linux > - > > Key: TIKA-3149 > URL: https://issues.apache.org/jira/browse/TIKA-3149 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 > Environment: linux and deployedo n weblogic >Reporter: Vishakha >Priority: Blocker > Labels: starter > > I am using tikka 1.18 version to parse the docuemtn content. It is working > independently when deployed on linux but it is not working. If tessract is > used before it. It is giving below error while parseTostring > code : > Tika tika = new Tika();Tika tika = new Tika(); > try(InputStream stream = new > FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) > { String documentExt = > tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); > String outputStr = tika.parseToString(stream); > String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " > +tempStr); } > catch (TikaException e) \{ > // TODO Auto-generated catch block _Logger.error("Error :",e); } > Error as : > java.lang.StackOverflowError > at > org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at >
[jira] [Updated] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test
[ https://issues.apache.org/jira/browse/TIKA-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3369: Description: Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with {noformat} [ERROR] TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in: http://www.w3.org/1999/xhtml;> Multipage TIFF Example Page 1 Multipage TIFF Example Page?2 {noformat} Take note that tesseract extract {{Page?2}} instead of {{Page 2}}. was: Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with {noformat} [ERROR] TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in: http://www.w3.org/1999/xhtml;> Multipage TIFF Example Page 1 Multipage TIFF Example Page?2 {noformat} > Flaky Tesseract OCR confirmMultiPageTiffHandling test > - > > Key: TIKA-3369 > URL: https://issues.apache.org/jira/browse/TIKA-3369 > Project: Tika > Issue Type: Test > Components: ocr >Affects Versions: 2.0.0 > Environment: Arch Linux, kernel: 5.11.16-arch1-1 #1 SMP PREEMPT Wed, > 21 Apr 2021 17:22:13 + x86_64 GNU/Linux > OpenJDK 15.0.2.u7-1 > Tesseract 4.1.1-5 with icu 69.1-1, cairo 1.17.4-5, pango 1:1.48.4-1, > tesseract-data-{eng,deu,fra,rus,ukr} 2:4.0.0-1 (other languages not installed) >Reporter: Konstantin Gribov >Priority: Minor > > Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with > {noformat} > [ERROR] > TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 > Page 2 not found in: > http://www.w3.org/1999/xhtml;> > > > /> > content="org.apache.tika.parser.ocr.TesseractOCRParser" /> > > > Multipage > TIFF > Example > Page 1 > Multipage > TIFF > Example > Page?2 > > > {noformat} > Take note that tesseract extract {{Page?2}} instead of {{Page 2}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test
Konstantin Gribov created TIKA-3369: --- Summary: Flaky Tesseract OCR confirmMultiPageTiffHandling test Key: TIKA-3369 URL: https://issues.apache.org/jira/browse/TIKA-3369 Project: Tika Issue Type: Test Components: ocr Affects Versions: 2.0.0 Environment: Arch Linux, kernel: 5.11.16-arch1-1 #1 SMP PREEMPT Wed, 21 Apr 2021 17:22:13 + x86_64 GNU/Linux OpenJDK 15.0.2.u7-1 Tesseract 4.1.1-5 with icu 69.1-1, cairo 1.17.4-5, pango 1:1.48.4-1, tesseract-data-{eng,deu,fra,rus,ukr} 2:4.0.0-1 (other languages not installed) Reporter: Konstantin Gribov Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with {noformat} [ERROR] TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in: http://www.w3.org/1999/xhtml;> Multipage TIFF Example Page 1 Multipage TIFF Example Page?2 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)
Konstantin Gribov created TIKA-3368: --- Summary: Add Bill of Materials (BOM) artifact (Tika 1.x) Key: TIKA-3368 URL: https://issues.apache.org/jira/browse/TIKA-3368 Project: Tika Issue Type: Improvement Components: packaging Reporter: Konstantin Gribov Assignee: Konstantin Gribov Fix For: 1.27 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3367: Fix Version/s: (was: 1.27) > Add Bill of Materials (BOM) artifact > > > Key: TIKA-3367 > URL: https://issues.apache.org/jira/browse/TIKA-3367 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3367) Add Bill of Materials (BOM) artifact
Konstantin Gribov created TIKA-3367: --- Summary: Add Bill of Materials (BOM) artifact Key: TIKA-3367 URL: https://issues.apache.org/jira/browse/TIKA-3367 Project: Tika Issue Type: Improvement Components: packaging Reporter: Konstantin Gribov Assignee: Konstantin Gribov Fix For: 2.0.0, 1.27 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3312) Support Log4j2 jar in Tika-app.jar
[ https://issues.apache.org/jira/browse/TIKA-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298345#comment-17298345 ] Konstantin Gribov commented on TIKA-3312: - [~tallison], in that case I think it could safely go into tika-server-core since it's already end-user runnable application. What do you think about extracting a module with just a bunch of runtime deps and configs for all cli tools? > Support Log4j2 jar in Tika-app.jar > -- > > Key: TIKA-3312 > URL: https://issues.apache.org/jira/browse/TIKA-3312 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.22, 1.24.1 >Reporter: Charushila Nanekar >Priority: Critical > > Latest version of Tika-app is using older version of Log4j jar which cause an > issue when Tika-app get integrated with other 3rd Party Application which > using latest log4j 2 jar. > Additionally, Apache Log4j 2 is an upgrade to Log4j that provides significant > improvements over its predecessor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3312) Support Log4j2 jar in Tika-app.jar
[ https://issues.apache.org/jira/browse/TIKA-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298099#comment-17298099 ] Konstantin Gribov commented on TIKA-3312: - I agree that we should upgrade to log4j2 or logback-classic as logging implementation. But I would advice against using tika-app as a library. [~cnanekar], could you tell us why you depend on it instead of tika-parsers/tika-batch etc? Than you could choose whichever logging impl you prefer with its configuration specific to your app. > Support Log4j2 jar in Tika-app.jar > -- > > Key: TIKA-3312 > URL: https://issues.apache.org/jira/browse/TIKA-3312 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.22, 1.24.1 >Reporter: Charushila Nanekar >Priority: Critical > > Latest version of Tika-app is using older version of Log4j jar which cause an > issue when Tika-app get integrated with other 3rd Party Application which > using latest log4j 2 jar. > Additionally, Apache Log4j 2 is an upgrade to Log4j that provides significant > improvements over its predecessor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3120) Remove whitelist/blacklist terminology
[ https://issues.apache.org/jira/browse/TIKA-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146315#comment-17146315 ] Konstantin Gribov commented on TIKA-3120: - [~tallison], I noticed messages from commits@tika.a.o about this and saw that you use include/skip pair. Did you choose one such pair or just gone with context dependent on case by case basis? If first it might be good idea to add recommended words for include/exclude to wiki for future contributors. > Remove whitelist/blacklist terminology > -- > > Key: TIKA-3120 > URL: https://issues.apache.org/jira/browse/TIKA-3120 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 1.25 > > > Looks trivial... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3121) Rename master branch
[ https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146290#comment-17146290 ] Konstantin Gribov commented on TIKA-3121: - Alternative is to use just branches like main, branch_1x, branch_2x etc, archive & lock master and set new branch as default HEAD. This way we will have much smoother transition with much smaller potential impact > Rename master branch > > > Key: TIKA-3121 > URL: https://issues.apache.org/jira/browse/TIKA-3121 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I started a discussion on the dev list for this here: > http://mail-archives.us.apache.org/mod_mbox/tika-dev/202006.mbox/%3CCAC1dCwW9FuK%2BkSzokmweeYwLFiED9g0W-43J1TNhMwnv7rdp8g%40mail.gmail.com%3E > One committer would prefer that we not make this change, but seems ok with it. > Recommendations: > * main > * trunk > * development > * stable -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3121) Rename master branch
[ https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146288#comment-17146288 ] Konstantin Gribov commented on TIKA-3121: - I didn't vote before and a bit ambivalent about change. Despite all Rich's pushing towards renaming I'm a bit concerned about real impact on developer biased community. For me it looks more like populist decision but I may be biased by previous hate storms that used D ideas against anyone who don't kneel and plead to spare them despite not being in some minority. We will have to go through documentation, wiki, configuration for CI etc to ensure that new branch name is used but we can do this only for our projects. All external developers who include Tika in their build systems, delivery pipelines, writes articles/books and using master branch would have to do some additional (and sometimes unexpected) work. In ideal world it would be done via usual scripts/configuration maintenance but a lot of thing with low prio support or without actual maintenance could break. So, I'm basically -0.5, weak against 'cause potential impact on downstream users and fellow developers. > Rename master branch > > > Key: TIKA-3121 > URL: https://issues.apache.org/jira/browse/TIKA-3121 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I started a discussion on the dev list for this here: > http://mail-archives.us.apache.org/mod_mbox/tika-dev/202006.mbox/%3CCAC1dCwW9FuK%2BkSzokmweeYwLFiED9g0W-43J1TNhMwnv7rdp8g%40mail.gmail.com%3E > One committer would prefer that we not make this change, but seems ok with it. > Recommendations: > * main > * trunk > * development > * stable -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3082) Consider adding an OpenAPI for tika-server
[ https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073020#comment-17073020 ] Konstantin Gribov commented on TIKA-3082: - Also we could later add client modules for couple of popular libraries to give downstream users ready-to-fly libs with already declared deps. > Consider adding an OpenAPI for tika-server > -- > > Key: TIKA-3082 > URL: https://issues.apache.org/jira/browse/TIKA-3082 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Lewis John McGibbney >Priority: Major > > On TIKA-2253, [~lewismc] asked: > bq. I was planning on putting together an OpenAPI specification for Tika. Is > anyone in favor of this? > What do people think? How much will it change the current tika-server? What > are the benefits? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3082) Consider adding an OpenAPI for tika-server
[ https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073017#comment-17073017 ] Konstantin Gribov commented on TIKA-3082: - [~lewismc], my gratitude and big +1 than) In my experience some OpenAPI/Swagger tools are quite fragile (like swagger-codegen could break on minor version update) but overall I'm very inclined to use it since it gives us better maintainability, documentation generation, easier API versioning. Also, I'd like to propose moving current APIs to versioned namespace {{/api/v1/*}} (and redirecting existing methods (like {{/meta}}, {{/rmeta}} etc) there with HTTP status 301. BTW, JetBrains IDEA has bundled OpenAPI plugin (at least 2020.1 RC does). > Consider adding an OpenAPI for tika-server > -- > > Key: TIKA-3082 > URL: https://issues.apache.org/jira/browse/TIKA-3082 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Lewis John McGibbney >Priority: Major > > On TIKA-2253, [~lewismc] asked: > bq. I was planning on putting together an OpenAPI specification for Tika. Is > anyone in favor of this? > What do people think? How much will it change the current tika-server? What > are the benefits? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3082) Consider adding an OpenAPI for tika-server
[ https://issues.apache.org/jira/browse/TIKA-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072957#comment-17072957 ] Konstantin Gribov commented on TIKA-3082: - [~lewismc], could you please clarify how do you wish to use OpenAPI spec? Since such spec could be used to generate client libraries and stubs for JAX-RS or it could be generated from some additional annotations on say JAX-RS services. Both solutions are viable but certainly depend on your goals in introducing OpenAPI. Both solutions have pros and cons, so I hope you'll have a some time to expand your original idea. > Consider adding an OpenAPI for tika-server > -- > > Key: TIKA-3082 > URL: https://issues.apache.org/jira/browse/TIKA-3082 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > On TIKA-2253, [~lewismc] asked: > bq. I was planning on putting together an OpenAPI specification for Tika. Is > anyone in favor of this? > What do people think? How much will it change the current tika-server? What > are the benefits? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3073) Add gzip in- and out- interceptors to tika-server
[ https://issues.apache.org/jira/browse/TIKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062662#comment-17062662 ] Konstantin Gribov commented on TIKA-3073: - [~tallison], glad to help. I'm unfamiliar with CXF so here you go. > Add gzip in- and out- interceptors to tika-server > - > > Key: TIKA-3073 > URL: https://issues.apache.org/jira/browse/TIKA-3073 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 1.25 > > > On TIKA-3069, [~carina.antunes] requested compressing /rmeta output. This > makes sense as a start...we might also look into allowing more > configurability around which metadata fields and file types to send back over > the wire. Few people need everything... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3073) Add compression option to /rmeta output
[ https://issues.apache.org/jira/browse/TIKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061672#comment-17061672 ] Konstantin Gribov edited comment on TIKA-3073 at 3/18/20, 12:32 PM: [~tallison], usually webserver should accept HTTP {{Accept-Encoding: gzip, deflate}} header (you could set it with curl's {{\-\-compressed}}), but I don't know how this should be configured in CXF. But it seems tika-server ignores it and just use {{chunked}}. So, IMHO, it's out of scope for JAX-RS but more to do with CXF/Jetty. Jetty itself has [https://www.eclipse.org/jetty/documentation/current/gzip-filter.html] which can be enabled for whole server using by adding it with {{org.eclipse.jetty.server.Server#insertHandler}}. Some servers would return {{Content-Encoding}} instead of {{Transfer-Encoding}} and curl supports both. To test just call {{curl \-\-compressed \-\-http1.1 -v https://code.jquery.com/jquery-3.3.1.slim.min.js}} with and without {{\-\-compressed}} flag. was (Author: grossws): [~tallison], usually webserver should accept HTTP {{Accept-Encoding: gzip, deflate}} header (you could set it with curl's --compressed), but I don't know how this should be configured in CXF. But it seems tika-server ignores it and just use {{chunked}}. So, IMHO, it's out of scope for JAX-RS but more to do with CXF/Jetty. Jetty itself has [https://www.eclipse.org/jetty/documentation/current/gzip-filter.html] which can be enabled for whole server using by adding it with {{org.eclipse.jetty.server.Server#insertHandler}}. Some servers would return {{Content-Encoding}} instead of {{Transfer-Encoding}} and curl supports both. To test just call {{curl --compressed --http1.1 -v [https://code.jquery.com/jquery-3.3.1.slim.min.js]-}} with and without {{-compressed}} flag. > Add compression option to /rmeta output > --- > > Key: TIKA-3073 > URL: https://issues.apache.org/jira/browse/TIKA-3073 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > On TIKA-3069, [~carina.antunes] requested compressing /rmeta output. This > makes sense as a start...we might also look into allowing more > configurability around which metadata fields and file types to send back over > the wire. Few people need everything... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3073) Add compression option to /rmeta output
[ https://issues.apache.org/jira/browse/TIKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061672#comment-17061672 ] Konstantin Gribov edited comment on TIKA-3073 at 3/18/20, 12:31 PM: [~tallison], usually webserver should accept HTTP {{Accept-Encoding: gzip, deflate}} header (you could set it with curl's --compressed), but I don't know how this should be configured in CXF. But it seems tika-server ignores it and just use {{chunked}}. So, IMHO, it's out of scope for JAX-RS but more to do with CXF/Jetty. Jetty itself has [https://www.eclipse.org/jetty/documentation/current/gzip-filter.html] which can be enabled for whole server using by adding it with {{org.eclipse.jetty.server.Server#insertHandler}}. Some servers would return {{Content-Encoding}} instead of {{Transfer-Encoding}} and curl supports both. To test just call {{curl --compressed --http1.1 -v [https://code.jquery.com/jquery-3.3.1.slim.min.js]-}} with and without {{-compressed}} flag. was (Author: grossws): [~tallison], usually webserver should accept HTTP {{Accept-Encoding: gzip, deflate}} header (you could set it with curl's --compressed), but I don't know how this should be configured in CXF. But it seems tika-server ignores it and just use {{chinked}}. So, IMHO, it's out of scope for JAX-RS but more to do with CXF/Jetty. Jetty itself has https://www.eclipse.org/jetty/documentation/current/gzip-filter.html which can be enabled for whole server using by adding it with {{org.eclipse.jetty.server.Server#insertHandler}}. Some servers would return {{Content-Encoding}} instead of {{Transfer-Encoding}} and curl supports both. To test just call {{curl --compressed --http1.1 -v https://code.jquery.com/jquery-3.3.1.slim.min.js}} with and without {{--compressed}} flag. > Add compression option to /rmeta output > --- > > Key: TIKA-3073 > URL: https://issues.apache.org/jira/browse/TIKA-3073 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > On TIKA-3069, [~carina.antunes] requested compressing /rmeta output. This > makes sense as a start...we might also look into allowing more > configurability around which metadata fields and file types to send back over > the wire. Few people need everything... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3073) Add compression option to /rmeta output
[ https://issues.apache.org/jira/browse/TIKA-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061672#comment-17061672 ] Konstantin Gribov commented on TIKA-3073: - [~tallison], usually webserver should accept HTTP {{Accept-Encoding: gzip, deflate}} header (you could set it with curl's --compressed), but I don't know how this should be configured in CXF. But it seems tika-server ignores it and just use {{chinked}}. So, IMHO, it's out of scope for JAX-RS but more to do with CXF/Jetty. Jetty itself has https://www.eclipse.org/jetty/documentation/current/gzip-filter.html which can be enabled for whole server using by adding it with {{org.eclipse.jetty.server.Server#insertHandler}}. Some servers would return {{Content-Encoding}} instead of {{Transfer-Encoding}} and curl supports both. To test just call {{curl --compressed --http1.1 -v https://code.jquery.com/jquery-3.3.1.slim.min.js}} with and without {{--compressed}} flag. > Add compression option to /rmeta output > --- > > Key: TIKA-3073 > URL: https://issues.apache.org/jira/browse/TIKA-3073 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > On TIKA-3069, [~carina.antunes] requested compressing /rmeta output. This > makes sense as a start...we might also look into allowing more > configurability around which metadata fields and file types to send back over > the wire. Few people need everything... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3019) [9.8] [CVE-2019-17571] [tika-app] [1.23]
[ https://issues.apache.org/jira/browse/TIKA-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016096#comment-17016096 ] Konstantin Gribov commented on TIKA-3019: - [~rgoers], yes, I mentioned {{log4j1.compatibility}} above. It may be not an ideal solution but could work in simple cases. > [9.8] [CVE-2019-17571] [tika-app] [1.23] > > > Key: TIKA-3019 > URL: https://issues.apache.org/jira/browse/TIKA-3019 > Project: Tika > Issue Type: Bug > Components: tika-batch >Affects Versions: 1.23 >Reporter: Aman Mishra >Priority: Major > > *Description :* > *Severity :* Sonatype CVSS 3: 9.8CVE CVSS 2.0: 0.0 > *Weakness :* Sonatype CWE: 502 > *Source :* National Vulnerability Database > *Categories :* Data > *Description from CVE :* Included in Log4j 1.2 is a SocketServer class that > is vulnerable to deserialization of untrusted data which can be exploited to > remotely execute arbitrary code when combined with a deserialization gadget > when listening to untrusted network traffic for log data. This affects Log4j > versions up to 1.2 up to 1.2.17. > *Explanation :* The log4j:log4j package is vulnerable to Remote Code > Execution [RCE] due to Deserialization of Untrusted Data. The > configureHierarchy and genericHierarchy methods in SocketServer.class do not > verify if the file at a given file path contains any untrusted objects prior > to deserializing them. A remote attacker can exploit this vulnerability by > providing a path to crafted files, which result in arbitrary code execution > when deserialized. > NOTE: Starting with version[s] 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. > *Detection :* The application is vulnerable by using this component. > *Recommendation :* Starting with version[s] 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. Therefore,it is recommended to upgrade to > org.apache.logging.log4j:log4j-core version[s] 2.8.2 and above. For > log4j:log4j 1.x versions however, a fix does not exist. > *Root Cause :* tika-app-1.23.jarorg/apache/log4j/net/SocketServer.class : [,] > *Advisories :* Project: [https://bugzilla.redhat.com/show_bug.cgi?id=1785616] > *CVSS Details :* Sonatype CVSS 3: 9.8CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (TIKA-3018) log4j 1.2 version used by Apache Tika 1.23 is vulnerable to CVE-2019-17571
[ https://issues.apache.org/jira/browse/TIKA-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-3018. --- > log4j 1.2 version used by Apache Tika 1.23 is vulnerable to CVE-2019-17571 > -- > > Key: TIKA-3018 > URL: https://issues.apache.org/jira/browse/TIKA-3018 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.23 >Reporter: Abhijit Rajwade >Priority: Major > > Sonatype Nexus auditor is reporting following log4j related security issue on > Apache Tika 1.23. > Recommendation is to use org.apache.logging.log4j:log4j-core version(s) 2.8.2 > and above. Can you please check if Apache Tika vulnerable and if so upgrade > based on the recommendation? > Description > Description from CVE > Included in Log4j 1.2 is a SocketServer class that is vulnerable to > deserialization of untrusted data which can be exploited to remotely execute > arbitrary code when combined with a deserialization gadget when listening to > untrusted network traffic for log data. This affects Log4j versions up to 1.2 > up to 1.2.17. > Explanation > The log4j:log4j package is vulnerable to Remote Code Execution (RCE) due > to Deserialization of Untrusted Data. The configureHierarchy and > genericHierarchy methods in SocketServer.class do not verify if the file at a > given file path contains any untrusted objects prior to deserializing them. A > remote attacker can exploit this vulnerability by providing a path to crafted > files, which result in arbitrary code execution when deserialized. > NOTE: Starting with version(s) 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. > Detection > The application is vulnerable by using this component. > Recommendation > Starting with version(s) 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. Therefore, it is recommended to upgrade to > org.apache.logging.log4j:log4j-core version(s) 2.8.2 and above. For > log4j:log4j 1.x versions however, a fix does not exist. > Root Cause > tika-app-1.23.jar <= org/apache/log4j/net/SocketServer.class : (,) > Advisories > Project: https://issues.apache.org/jira/browse/LOG4J2-1863 > Project: https://lists.apache.org/thread.html/84cc4266238e057b95eb95d… > Third Party: https://bugzilla.redhat.com/show_bug.cgi?id=1785616 > CVSS Details > Sonatype CVSS 3: 9.8 > CVSS Vector: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3018) log4j 1.2 version used by Apache Tika 1.23 is vulnerable to CVE-2019-17571
[ https://issues.apache.org/jira/browse/TIKA-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3018. - Resolution: Duplicate > log4j 1.2 version used by Apache Tika 1.23 is vulnerable to CVE-2019-17571 > -- > > Key: TIKA-3018 > URL: https://issues.apache.org/jira/browse/TIKA-3018 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.23 >Reporter: Abhijit Rajwade >Priority: Major > > Sonatype Nexus auditor is reporting following log4j related security issue on > Apache Tika 1.23. > Recommendation is to use org.apache.logging.log4j:log4j-core version(s) 2.8.2 > and above. Can you please check if Apache Tika vulnerable and if so upgrade > based on the recommendation? > Description > Description from CVE > Included in Log4j 1.2 is a SocketServer class that is vulnerable to > deserialization of untrusted data which can be exploited to remotely execute > arbitrary code when combined with a deserialization gadget when listening to > untrusted network traffic for log data. This affects Log4j versions up to 1.2 > up to 1.2.17. > Explanation > The log4j:log4j package is vulnerable to Remote Code Execution (RCE) due > to Deserialization of Untrusted Data. The configureHierarchy and > genericHierarchy methods in SocketServer.class do not verify if the file at a > given file path contains any untrusted objects prior to deserializing them. A > remote attacker can exploit this vulnerability by providing a path to crafted > files, which result in arbitrary code execution when deserialized. > NOTE: Starting with version(s) 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. > Detection > The application is vulnerable by using this component. > Recommendation > Starting with version(s) 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. Therefore, it is recommended to upgrade to > org.apache.logging.log4j:log4j-core version(s) 2.8.2 and above. For > log4j:log4j 1.x versions however, a fix does not exist. > Root Cause > tika-app-1.23.jar <= org/apache/log4j/net/SocketServer.class : (,) > Advisories > Project: https://issues.apache.org/jira/browse/LOG4J2-1863 > Project: https://lists.apache.org/thread.html/84cc4266238e057b95eb95d… > Third Party: https://bugzilla.redhat.com/show_bug.cgi?id=1785616 > CVSS Details > Sonatype CVSS 3: 9.8 > CVSS Vector: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3019) [9.8] [CVE-2019-17571] [tika-app] [1.23]
[ https://issues.apache.org/jira/browse/TIKA-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013000#comment-17013000 ] Konstantin Gribov commented on TIKA-3019: - [~tallison], there seems to be actually twofold issue with downstream users who depends on tika-app/server/eval using log4j 1.2.x: logging backend configuration and directly using log4j 1.x API (e.g. LogManager etc). As I don't use log4j logging backend I may overlook something. It's unlikely that downstream folks would depend on tika-app/server, so I'll say that if we encounter someone really using it that way we advice to update or use log4j-1.2-api module (1.2.x bridge to 2.x API). If they don't use *internal* API it should be ok. See https://logging.apache.org/log4j/2.x/manual/migration.html and https://logging.apache.org/log4j/2.x/manual/compatibility.html. Most likely we will break programmatic configuration in this case (like someone use their own main class with -q/-v parameters). As for configuration side downstream user could use {{log4j1.compatibility}} system property to use old configs but there're some caveats (like custom appender depends on some log4j12 implementation). Again, recommend to update or downgrade to 1.2.x like [~kkrugler] said with clear warning about CVE is all we can do here, I guess. Also it seems this vulnerability in SocketServer will only affect those who wish to accept logging events via tcp from different services. I couldn't imagine such use for tika-app/server off the top of my head. Most likely we aren't affected by this CVE at all. My vote is for migration to 2.x and pointing users to aforementioned migration/compatibility guides. > [9.8] [CVE-2019-17571] [tika-app] [1.23] > > > Key: TIKA-3019 > URL: https://issues.apache.org/jira/browse/TIKA-3019 > Project: Tika > Issue Type: Bug > Components: tika-batch >Affects Versions: 1.23 >Reporter: Aman Mishra >Priority: Major > > *Description :* > *Severity :* Sonatype CVSS 3: 9.8CVE CVSS 2.0: 0.0 > *Weakness :* Sonatype CWE: 502 > *Source :* National Vulnerability Database > *Categories :* Data > *Description from CVE :* Included in Log4j 1.2 is a SocketServer class that > is vulnerable to deserialization of untrusted data which can be exploited to > remotely execute arbitrary code when combined with a deserialization gadget > when listening to untrusted network traffic for log data. This affects Log4j > versions up to 1.2 up to 1.2.17. > *Explanation :* The log4j:log4j package is vulnerable to Remote Code > Execution [RCE] due to Deserialization of Untrusted Data. The > configureHierarchy and genericHierarchy methods in SocketServer.class do not > verify if the file at a given file path contains any untrusted objects prior > to deserializing them. A remote attacker can exploit this vulnerability by > providing a path to crafted files, which result in arbitrary code execution > when deserialized. > NOTE: Starting with version[s] 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. > *Detection :* The application is vulnerable by using this component. > *Recommendation :* Starting with version[s] 2.x, log4j:log4j was relocated to > org.apache.logging.log4j:log4j-core. A variation of this vulnerability exists > in org.apache.logging.log4j:log4j-core as CVE-2017-5645, in versions up to > but excluding 2.8.2. Therefore,it is recommended to upgrade to > org.apache.logging.log4j:log4j-core version[s] 2.8.2 and above. For > log4j:log4j 1.x versions however, a fix does not exist. > *Root Cause :* tika-app-1.23.jarorg/apache/log4j/net/SocketServer.class : [,] > *Advisories :* Project: [https://bugzilla.redhat.com/show_bug.cgi?id=1785616] > *CVSS Details :* Sonatype CVSS 3: 9.8CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (TIKA-2601) Invalid XHTML output (overlapping a and formatting tags) for some WORD documents
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2601. --- > Invalid XHTML output (overlapping a and formatting tags) for some WORD > documents > > > Key: TIKA-2601 > URL: https://issues.apache.org/jira/browse/TIKA-2601 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 > Environment: Linked is a sample document with its corresponding > output. >Reporter: Filip >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0, 1.21 > > Attachments: Invalid-XML.doc, Test.doc, test.html > > > In some WORD (.doc, .docx) documents the XHTML elements are not closed > properly. This usually happens when there are link elements () as well as > italic or bold elements (). > > Fix should be done in > [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-879. -- > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov >Priority: Major > Labels: new-parser > Fix For: 2.0, 1.18 > > Attachments: TIKA-879-thunderbird.eml, mbox_email_section.txt, > mime_diffs_A_to_B.html > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-2209) Update PDFBox to 2.0.4
[ https://issues.apache.org/jira/browse/TIKA-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2209. --- > Update PDFBox to 2.0.4 > -- > > Key: TIKA-2209 > URL: https://issues.apache.org/jira/browse/TIKA-2209 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Trivial > Fix For: 2.0, 1.15 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-2681) Upgrade to PDFBox 2.0.11
[ https://issues.apache.org/jira/browse/TIKA-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2681. --- > Upgrade to PDFBox 2.0.11 > > > Key: TIKA-2681 > URL: https://issues.apache.org/jira/browse/TIKA-2681 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0, 1.19 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-2622) Upgrade to PDFBox 2.0.10 when available
[ https://issues.apache.org/jira/browse/TIKA-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2622. --- > Upgrade to PDFBox 2.0.10 when available > --- > > Key: TIKA-2622 > URL: https://issues.apache.org/jira/browse/TIKA-2622 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2566) Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2566. - Resolution: Fixed > Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in > the rest of Tika > -- > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2566) Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-2566: Summary: Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in the rest of Tika (was: Move logging in tika-core to log4j via slf4j as we do in the rest of Tika) > Move logging in tika-core to slf4j-api (with log4j in test scope) as we do in > the rest of Tika > -- > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2314) Migrate logging to slf4j in master (2.x) branch
[ https://issues.apache.org/jira/browse/TIKA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2314. - Resolution: Resolved > Migrate logging to slf4j in master (2.x) branch > --- > > Key: TIKA-2314 > URL: https://issues.apache.org/jira/browse/TIKA-2314 > Project: Tika > Issue Type: Improvement >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Labels: logging > Fix For: 2.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-2315) Update logging page at wiki with actual info
[ https://issues.apache.org/jira/browse/TIKA-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2315. --- > Update logging page at wiki with actual info > > > Key: TIKA-2315 > URL: https://issues.apache.org/jira/browse/TIKA-2315 > Project: Tika > Issue Type: Task >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Minor > Labels: logging > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2315) Update logging page at wiki with actual info
[ https://issues.apache.org/jira/browse/TIKA-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2315. - Resolution: Fixed > Update logging page at wiki with actual info > > > Key: TIKA-2315 > URL: https://issues.apache.org/jira/browse/TIKA-2315 > Project: Tika > Issue Type: Task >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Minor > Labels: logging > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2314) Migrate logging to slf4j in master (2.x) branch
[ https://issues.apache.org/jira/browse/TIKA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-2314: Summary: Migrate logging to slf4j in master (2.x) branch (was: Migrate logging to slf4j in 2.x branch) > Migrate logging to slf4j in master (2.x) branch > --- > > Key: TIKA-2314 > URL: https://issues.apache.org/jira/browse/TIKA-2314 > Project: Tika > Issue Type: Improvement >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Labels: logging > Fix For: 2.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-2566: Fix Version/s: (was: 1.20) > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2566. - Resolution: Fixed Fix Version/s: 1.20 > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0, 1.20 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reopened TIKA-2566: - > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0, 1.20 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2555) Text with [underline] + [another format] in word document generates overlapping html tags.
[ https://issues.apache.org/jira/browse/TIKA-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2555. - Resolution: Fixed Fix Version/s: 1.21 2.0 > Text with [underline] + [another format] in word document generates > overlapping html tags. > -- > > Key: TIKA-2555 > URL: https://issues.apache.org/jira/browse/TIKA-2555 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 >Reporter: Serban Alexe >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0, 1.21 > > Attachments: Clipboard02.jpg > > > I have a sample _.docx_ document which contains one single line of text**++. > Making that text to be: > * +underlined+ > ** AND at least one of the following two > * _italic_ > * *bold* > will cause the generated _.xhtml_ file to contain overlapping tags. > > _+Example+_: > *+The quick brown fox jumps over the lazy dog.+* > will result in > The quick brown fox jumps over the lazy dog. > which causes some browser (Firefox, Chrome) to give an error and not display > the content of the file... > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2601) Invalid XHTML output (overlapping a and formatting tags) for some WORD documents
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2601. - Resolution: Fixed Fix Version/s: 1.21 2.0 > Invalid XHTML output (overlapping a and formatting tags) for some WORD > documents > > > Key: TIKA-2601 > URL: https://issues.apache.org/jira/browse/TIKA-2601 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 > Environment: Linked is a sample document with its corresponding > output. >Reporter: Filip >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0, 1.21 > > Attachments: Invalid-XML.doc, Test.doc, test.html > > > In some WORD (.doc, .docx) documents the XHTML elements are not closed > properly. This usually happens when there are link elements () as well as > italic or bold elements (). > > Fix should be done in > [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reassigned TIKA-2566: --- Assignee: Konstantin Gribov > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (TIKA-2601) Invalid XHTML output (overlapping a and formatting tags) for some WORD documents
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reopened TIKA-2601: - Assignee: Konstantin Gribov > Invalid XHTML output (overlapping a and formatting tags) for some WORD > documents > > > Key: TIKA-2601 > URL: https://issues.apache.org/jira/browse/TIKA-2601 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 > Environment: Linked is a sample document with its corresponding > output. >Reporter: Filip >Assignee: Konstantin Gribov >Priority: Major > Attachments: Invalid-XML.doc, Test.doc, test.html > > > In some WORD (.doc, .docx) documents the XHTML elements are not closed > properly. This usually happens when there are link elements () as well as > italic or bold elements (). > > Fix should be done in > [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2601) Invalid XHTML output (overlapping a and formatting tags) for some WORD documents
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-2601: Summary: Invalid XHTML output (overlapping a and formatting tags) for some WORD documents (was: Invalid XHTML output for some WORD documents) > Invalid XHTML output (overlapping a and formatting tags) for some WORD > documents > > > Key: TIKA-2601 > URL: https://issues.apache.org/jira/browse/TIKA-2601 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 > Environment: Linked is a sample document with its corresponding > output. >Reporter: Filip >Priority: Major > Attachments: Invalid-XML.doc, Test.doc, test.html > > > In some WORD (.doc, .docx) documents the XHTML elements are not closed > properly. This usually happens when there are link elements () as well as > italic or bold elements (). > > Fix should be done in > [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-2347) Underlined text is not decorated as such when extracting from word documents
[ https://issues.apache.org/jira/browse/TIKA-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2347. --- > Underlined text is not decorated as such when extracting from word documents > > > Key: TIKA-2347 > URL: https://issues.apache.org/jira/browse/TIKA-2347 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0, 1.14 >Reporter: Stuart Hendren >Assignee: Dave Meikle >Priority: Major > Fix For: 1.17 > > > When extracting from doc and docx bold and italic text decoration is > extracted, however underlining is not. Can be demonstrated in WordParserTest > or OOXMLParserTest (change to docx) with the following test case. > {code:title=WordParserTest.java|borderStyle=solid} > @Test > public void testTextDecoration() throws Exception { > XMLResult result = getXML("testWORD_various.doc"); > String xml = result.xml; > assertTrue(xml.contains("Bold")); > assertTrue(xml.contains("italic")); > assertTrue(xml.contains("underline")); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2601) Invalid XHTML output for some WORD documents
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2601. - Resolution: Duplicate I mark it as duplicate for TIKA-2555 which I'm currently looking into > Invalid XHTML output for some WORD documents > > > Key: TIKA-2601 > URL: https://issues.apache.org/jira/browse/TIKA-2601 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 > Environment: Linked is a sample document with its corresponding > output. >Reporter: Filip >Priority: Major > Attachments: Invalid-XML.doc, Test.doc, test.html > > > In some WORD (.doc, .docx) documents the XHTML elements are not closed > properly. This usually happens when there are link elements () as well as > italic or bold elements (). > > Fix should be done in > [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TIKA-2555) Text with [underline] + [another format] in word document generates overlapping html tags.
[ https://issues.apache.org/jira/browse/TIKA-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reassigned TIKA-2555: --- Assignee: Konstantin Gribov > Text with [underline] + [another format] in word document generates > overlapping html tags. > -- > > Key: TIKA-2555 > URL: https://issues.apache.org/jira/browse/TIKA-2555 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 >Reporter: Serban Alexe >Assignee: Konstantin Gribov >Priority: Minor > Attachments: Clipboard02.jpg > > > I have a sample _.docx_ document which contains one single line of text**++. > Making that text to be: > * +underlined+ > ** AND at least one of the following two > * _italic_ > * *bold* > will cause the generated _.xhtml_ file to contain overlapping tags. > > _+Example+_: > *+The quick brown fox jumps over the lazy dog.+* > will result in > The quick brown fox jumps over the lazy dog. > which causes some browser (Firefox, Chrome) to give an error and not display > the content of the file... > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797354#comment-16797354 ] Konstantin Gribov commented on TIKA-2566: - Just to clarify which option you propose: 1. use slf4j-api in tika-core; slf4j-api with bridges in tika-parsers and log4j:1.x or log4j-core:2.x as implementation in tika-app etc; 2. use log4j-api:2.x in tika-core, log4j-api:2.x with bridges (slf4j, jul & jcl to log4j2-api) in tika-parser and log4j-core:2.x as implementation; 3. use log4j-api:2.x in tika-core/tika-parsers; force user to configure logging deps correctly to use tika-parsers and use log4j-core:2.x as implementation in tika-app etc? Option 1 is what I suggested initially in TIKA-2245 and as currently in master. Option 2 is similar but seems to be more complex since we will still have slf4j-api, bridge for commons-logging/jcl, bridge for JUL and bridge for slf4j. Option 3 is less preferable since it requires downstream user to add all bridges manually, is error-prone and could be a bit fragile. My preference in this case is to use option 1 since its logical improvement from current status quo (JUL in tika-core and slf4j+jul-to-slf4j+jcl-over-slf4j in tika-parsers). Then downstream user can use: - log4j 1.x: add log4j:1.x and slf4j-log4j12; - logback-classic: just add logback-classic; - log4j 2.x: add log4j-api, log4j-core, log4j-slf4j-impl (slf4j bridge), log4j-jcl (commons-logging/jcl bridge), log4j-jul (JUL bridge) and exclude jul-to-slf4j and jcl-over-slf4j. > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797288#comment-16797288 ] Konstantin Gribov commented on TIKA-2566: - Since log4j2 has bridge to slf4j-api I'm don't see any major issues with using it. I prefer slf4j mostly because its wide adoption but log4j2 seems to be good alternative today. > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796251#comment-16796251 ] Konstantin Gribov commented on TIKA-2566: - [~talli...@apache.org], why do you prefer log4j (which is implementation first of all) instead of thin facade (slf4j)? Log4j 1.2 and 2.x are good as implementation (like in tika-batch, tika-app, tika-server and tika-eval) but as library dependency seems much less preferable even to commons-logging/jcl (which is both facade and impl in one package) to me. Or I misunderstood you and you actually suggest to use log4j2-api? I personally prefer slf4j-api for its stability and wide adoption. Only known major issue with it is JPMS support (because of static binding approach used in 1.7.x) but they are going to fix it in 1.8.x branch without breaking API. > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika
[ https://issues.apache.org/jira/browse/TIKA-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796262#comment-16796262 ] Konstantin Gribov commented on TIKA-2566: - JFYI: https://www.slf4j.org/faq.html#changesInVersion18 states that "There are no client facing API changes in 1.8.x". It has version 1.8.0-beta4 right now in central but I hope it would be released soon. > Move logging in tika-core to log4j via slf4j as we do in the rest of Tika > - > > Key: TIKA-2566 > URL: https://issues.apache.org/jira/browse/TIKA-2566 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2245) Standardise logging
[ https://issues.apache.org/jira/browse/TIKA-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796242#comment-16796242 ] Konstantin Gribov commented on TIKA-2245: - [~talli...@apache.org], slf4j-api is quite stable from API perspective, so it should be compatible with other 1.7.x versions. But it's better to use same slf4j-api and implementation versions as SPI compatibility is not guaranteed. Sorry for belated answer. > Standardise logging > --- > > Key: TIKA-2245 > URL: https://issues.apache.org/jira/browse/TIKA-2245 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14, 1.15 >Reporter: Matthew Caruana Galizia >Assignee: Konstantin Gribov >Priority: Major > Labels: logging > Fix For: 1.15 > > > Tika parsers sometimes use Log4j's Logger, sometimes the JUL > (java.util.logging) Logger and sometimes SLF4j. > It would be better to standardise on a single facade, for the sake of not > having to configure multiple loggers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2756) Switch to commons-lang 3
[ https://issues.apache.org/jira/browse/TIKA-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693459#comment-16693459 ] Konstantin Gribov edited comment on TIKA-2756 at 11/20/18 4:36 PM: --- FYI, issue seems to be present only with old commons-lang (2.4) and absent with more recent like 2.6. *UPD*: Your code to reproduce issue seems to work with 2.6, I haven't tested original issue. was (Author: grossws): FYI, issue seems to be present only with old commons-lang (2.4) and absent with more recent like 2.6 > Switch to commons-lang 3 > > > Key: TIKA-2756 > URL: https://issues.apache.org/jira/browse/TIKA-2756 > Project: Tika > Issue Type: Improvement >Reporter: Robert Munteanu >Priority: Major > > Tika 1.9.1 is using the legacy commons-lang 2.x series. This series is not > going to receive updates anymore and is completely superseded by commons-lang > 3.x . > Projects that use Tika are blocked from dropping commons-lang 2.x due to this > dependency. > The link that I found was from tika-parsers to jackcess and then to > commons-lang 2.6 > {noformat} > [INFO] +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile > [INFO] | \- commons-lang:commons-lang:jar:2.6:compile > {noformat} > If I understand correctly, this is the only commons-lang 2.x dependency from > the Tika runtime and it would be great to remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2756) Switch to commons-lang 3
[ https://issues.apache.org/jira/browse/TIKA-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693459#comment-16693459 ] Konstantin Gribov commented on TIKA-2756: - FYI, issue seems to be present only with old commons-lang (2.4) and absent with more recent like 2.6 > Switch to commons-lang 3 > > > Key: TIKA-2756 > URL: https://issues.apache.org/jira/browse/TIKA-2756 > Project: Tika > Issue Type: Improvement >Reporter: Robert Munteanu >Priority: Major > > Tika 1.9.1 is using the legacy commons-lang 2.x series. This series is not > going to receive updates anymore and is completely superseded by commons-lang > 3.x . > Projects that use Tika are blocked from dropping commons-lang 2.x due to this > dependency. > The link that I found was from tika-parsers to jackcess and then to > commons-lang 2.6 > {noformat} > [INFO] +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile > [INFO] | \- commons-lang:commons-lang:jar:2.6:compile > {noformat} > If I understand correctly, this is the only commons-lang 2.x dependency from > the Tika runtime and it would be great to remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-2721) Exclude Spring (transitive dependency) from tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2721. --- > Exclude Spring (transitive dependency) from tika-parsers > > > Key: TIKA-2721 > URL: https://issues.apache.org/jira/browse/TIKA-2721 > Project: Tika > Issue Type: Bug > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0, 1.19 > > > {{uimafit-core}} brings {{spring-core}}, {{spring-beans}} and > {{spring-context}} with quite ancient version 3.2.x which is not required for > parsing and usually clash with actual Spring libs or just pollutes jar if > uberjar (shade plugin, onejar, assembly plugin with jar-with-dependencies > etc) is used. > Its exclusion from deps seems more or less safe to me. But formally it can be > seen as breaking change if someone depends on that tika-parsers provides > spring libs transitively. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2552) Upgrade to POI 4.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617718#comment-16617718 ] Konstantin Gribov commented on TIKA-2552: - [~TigerC10], Tim rolled RC1 this weekend, so, hopefully this week. > Upgrade to POI 4.0.0 when available > --- > > Key: TIKA-2552 > URL: https://issues.apache.org/jira/browse/TIKA-2552 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Blocker > Fix For: 1.19, 2.0.0 > > Attachments: TIKA-2552_--_first_draft.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2716) Sonatype Nexus auditor is reporting that spring framework vesrion used by Tika 1.18 is vulnerable
[ https://issues.apache.org/jira/browse/TIKA-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603335#comment-16603335 ] Konstantin Gribov commented on TIKA-2716: - Won't Fix because {{spring-*}} is excluded from dependency tree now (see TIKA-2721) > Sonatype Nexus auditor is reporting that spring framework vesrion used by > Tika 1.18 is vulnerable > - > > Key: TIKA-2716 > URL: https://issues.apache.org/jira/browse/TIKA-2716 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.18 >Reporter: Abhijit Rajwade >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0, 1.19 > > > Sonatype Nexus auditor is reporting that spring framework version used by > Apache Tika 1.18 is vulnerable. Recommendation is to upgrade to a non > vulnerable version of Spring framework - 4.3.15/later or 5.0.5/later > > Refer following details > > Issue > [CVE-2018-1270|http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-1270] > > Source National Vulnerability Database > > Severity > CVE CVSS 3.0: 9.8 > CVE CVSS 2.0: 7.5 > Sonatype CVSS 3.0: 9.8 > > Weakness > CVE CWE: [358|https://cwe.mitre.org/data/definitions/358.html] > > Description from CVE > Spring Framework, versions 5.0 prior to 5.0.5 and versions 4.3 prior to > 4.3.15 and older unsupported versions, allow applications to expose STOMP > over WebSocket endpoints with a simple, in-memory STOMP broker through the > spring-messaging module. A malicious user (or attacker) can craft a message > to the broker that can lead to a remote code execution attack. > Explanation > The Spring Framework {{spring-messaging}} module is vulnerable to Remote Code > Execution (RCE). The {{getMethods()}} method in the > {{ReflectiveMethodResolver}} class, the {{canWrite}} method in the > {{ReflectivePropertyAccessor}} class, and the {{filterSubscriptions()}} > method in the {{DefaultSubscriptionRegistry}} class do not properly restrict > SpEL expression evaluation. A remote attacker can exploit this vulnerability > by crafting a request to an exposed STOMP endpoint and injecting a malicious > payload into the {{selector}} header. The application would then execute the > payload via a call to {{expression.getValue()}} whenever a new message is > sent to the broker. > > Detection > The application is vulnerable by using this component. > > Recommendation > We recommend upgrading to a version of this component that is not vulnerable > to this specific issue. > Categories > Data > Root Cause > tika-app-1.18.jar *<=* ReflectivePropertyAccessor.class : [3.0.0.RELEASE , > 4.3.15.RELEASE) > tika-app-1.18.jar *<=* ReflectiveMethodResolver.class : [3.0.0.RELEASE , > 4.3.15.RELEASE) > > Advisories > Attack: [http://www.polaris-lab.com/index.php/archives/501/] > Attack: > [https://chybeta.github.io/2018/04/07/spring-messaging-Remote...|https://chybeta.github.io/2018/04/07/spring-messaging-Remote-Code-Execution-%E5%88%86%E6%9E%90-%E3%80%90CVE-2018-1270%E3%80%91/] > Project: [https://jira.spring.io/browse/SPR-16588] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (TIKA-2716) Sonatype Nexus auditor is reporting that spring framework vesrion used by Tika 1.18 is vulnerable
[ https://issues.apache.org/jira/browse/TIKA-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2716. --- Resolution: Won't Fix Assignee: Konstantin Gribov Fix Version/s: 1.19 2.0 > Sonatype Nexus auditor is reporting that spring framework vesrion used by > Tika 1.18 is vulnerable > - > > Key: TIKA-2716 > URL: https://issues.apache.org/jira/browse/TIKA-2716 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.18 >Reporter: Abhijit Rajwade >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0, 1.19 > > > Sonatype Nexus auditor is reporting that spring framework version used by > Apache Tika 1.18 is vulnerable. Recommendation is to upgrade to a non > vulnerable version of Spring framework - 4.3.15/later or 5.0.5/later > > Refer following details > > Issue > [CVE-2018-1270|http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-1270] > > Source National Vulnerability Database > > Severity > CVE CVSS 3.0: 9.8 > CVE CVSS 2.0: 7.5 > Sonatype CVSS 3.0: 9.8 > > Weakness > CVE CWE: [358|https://cwe.mitre.org/data/definitions/358.html] > > Description from CVE > Spring Framework, versions 5.0 prior to 5.0.5 and versions 4.3 prior to > 4.3.15 and older unsupported versions, allow applications to expose STOMP > over WebSocket endpoints with a simple, in-memory STOMP broker through the > spring-messaging module. A malicious user (or attacker) can craft a message > to the broker that can lead to a remote code execution attack. > Explanation > The Spring Framework {{spring-messaging}} module is vulnerable to Remote Code > Execution (RCE). The {{getMethods()}} method in the > {{ReflectiveMethodResolver}} class, the {{canWrite}} method in the > {{ReflectivePropertyAccessor}} class, and the {{filterSubscriptions()}} > method in the {{DefaultSubscriptionRegistry}} class do not properly restrict > SpEL expression evaluation. A remote attacker can exploit this vulnerability > by crafting a request to an exposed STOMP endpoint and injecting a malicious > payload into the {{selector}} header. The application would then execute the > payload via a call to {{expression.getValue()}} whenever a new message is > sent to the broker. > > Detection > The application is vulnerable by using this component. > > Recommendation > We recommend upgrading to a version of this component that is not vulnerable > to this specific issue. > Categories > Data > Root Cause > tika-app-1.18.jar *<=* ReflectivePropertyAccessor.class : [3.0.0.RELEASE , > 4.3.15.RELEASE) > tika-app-1.18.jar *<=* ReflectiveMethodResolver.class : [3.0.0.RELEASE , > 4.3.15.RELEASE) > > Advisories > Attack: [http://www.polaris-lab.com/index.php/archives/501/] > Attack: > [https://chybeta.github.io/2018/04/07/spring-messaging-Remote...|https://chybeta.github.io/2018/04/07/spring-messaging-Remote-Code-Execution-%E5%88%86%E6%9E%90-%E3%80%90CVE-2018-1270%E3%80%91/] > Project: [https://jira.spring.io/browse/SPR-16588] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2721) Exclude Spring (transitive dependency) from tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2721. - Resolution: Fixed > Exclude Spring (transitive dependency) from tika-parsers > > > Key: TIKA-2721 > URL: https://issues.apache.org/jira/browse/TIKA-2721 > Project: Tika > Issue Type: Bug > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0, 1.19 > > > {{uimafit-core}} brings {{spring-core}}, {{spring-beans}} and > {{spring-context}} with quite ancient version 3.2.x which is not required for > parsing and usually clash with actual Spring libs or just pollutes jar if > uberjar (shade plugin, onejar, assembly plugin with jar-with-dependencies > etc) is used. > Its exclusion from deps seems more or less safe to me. But formally it can be > seen as breaking change if someone depends on that tika-parsers provides > spring libs transitively. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2721) Exclude Spring (transitive dependency) from tika-parsers
[ https://issues.apache.org/jira/browse/TIKA-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603230#comment-16603230 ] Konstantin Gribov commented on TIKA-2721: - All unit & integration tests passed after excluding {{spring-*}} from {{uimafit-core}}. > Exclude Spring (transitive dependency) from tika-parsers > > > Key: TIKA-2721 > URL: https://issues.apache.org/jira/browse/TIKA-2721 > Project: Tika > Issue Type: Bug > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Minor > Fix For: 2.0, 1.19 > > > {{uimafit-core}} brings {{spring-core}}, {{spring-beans}} and > {{spring-context}} with quite ancient version 3.2.x which is not required for > parsing and usually clash with actual Spring libs or just pollutes jar if > uberjar (shade plugin, onejar, assembly plugin with jar-with-dependencies > etc) is used. > Its exclusion from deps seems more or less safe to me. But formally it can be > seen as breaking change if someone depends on that tika-parsers provides > spring libs transitively. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2721) Exclude Spring (transitive dependency) from tika-parsers
Konstantin Gribov created TIKA-2721: --- Summary: Exclude Spring (transitive dependency) from tika-parsers Key: TIKA-2721 URL: https://issues.apache.org/jira/browse/TIKA-2721 Project: Tika Issue Type: Bug Components: packaging Reporter: Konstantin Gribov Assignee: Konstantin Gribov Fix For: 2.0, 1.19 {{uimafit-core}} brings {{spring-core}}, {{spring-beans}} and {{spring-context}} with quite ancient version 3.2.x which is not required for parsing and usually clash with actual Spring libs or just pollutes jar if uberjar (shade plugin, onejar, assembly plugin with jar-with-dependencies etc) is used. Its exclusion from deps seems more or less safe to me. But formally it can be seen as breaking change if someone depends on that tika-parsers provides spring libs transitively. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602361#comment-16602361 ] Konstantin Gribov commented on TIKA-2680: - Just my 2c, I've stopped using Tika for RFC822 parsing somewhere in 2012-2013 and using mime4j directly for RFC822 and delegate attachment parsing to Tika. But in my case I know beforehand what I'll parse (normal files, plain emls, emls with external metadata from DLP system or MSE journaled emls) so I can parse them with specific parser. Of course I have to track if I'm parsing an attachment (set/reset flag in field handler if {{Content-Disposition}} found with/without it; and reset flag in {{startBodyPart}}) and current depth in multipart tree handling. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)