[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635997#comment-17635997 ] Hudson commented on TIKA-3308: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #930 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/930/]) TIKA-3308 -- add detection for svg files that lack the xml header (#808) (github: [https://github.com/apache/tika/commit/0145868ab5c1f2718dc3267e50737d22effb3ce6]) * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/test/resources/test-documents/testSVG_no_xml_header.svg * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java * (edit) CHANGES.txt > SVG file without xml declaration tag is detected as text/plain > -- > > Key: TIKA-3308 > URL: https://issues.apache.org/jira/browse/TIKA-3308 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.25 >Reporter: Anas Hammani >Priority: Minor > Fix For: 2.6.1 > > Attachments: logo-luma.svg > > > The SVG file attached to the issue is interpreted as *text/plain* by > {code:java} > tika.detect(filePath){code} > > If I add > {code:java} > {code} > at the beginning of the file, then tika detects it as "image/svg+xml" > > When i read the documentation i see that xml is not necessary for a file to > be well-formed > [https://www.w3.org/TR/REC-xml/#sec-prolog-dtd] > > It will be great if tika can detect a file as a SVG without the prolog > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3308. --- Fix Version/s: 2.6.1 Resolution: Fixed > SVG file without xml declaration tag is detected as text/plain > -- > > Key: TIKA-3308 > URL: https://issues.apache.org/jira/browse/TIKA-3308 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.25 >Reporter: Anas Hammani >Priority: Minor > Fix For: 2.6.1 > > Attachments: logo-luma.svg > > > The SVG file attached to the issue is interpreted as *text/plain* by > {code:java} > tika.detect(filePath){code} > > If I add > {code:java} > {code} > at the beginning of the file, then tika detects it as "image/svg+xml" > > When i read the documentation i see that xml is not necessary for a file to > be well-formed > [https://www.w3.org/TR/REC-xml/#sec-prolog-dtd] > > It will be great if tika can detect a file as a SVG without the prolog > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3933) Migrate to Bouncy Castle for JDK 1.8
[ https://issues.apache.org/jira/browse/TIKA-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635966#comment-17635966 ] Hudson commented on TIKA-3933: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #929 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/929/]) TIKA-3933 (#807) (github: [https://github.com/apache/tika/commit/4f84edd7ebfac943661a8e043047c70b6da21be3]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-digest-commons/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/pom.xml * (edit) CHANGES.txt * (edit) tika-parent/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml > Migrate to Bouncy Castle for JDK 1.8 > > > Key: TIKA-3933 > URL: https://issues.apache.org/jira/browse/TIKA-3933 > Project: Tika > Issue Type: Task >Reporter: Valery Yatsynovich >Priority: Minor > Fix For: 2.6.1 > > > [https://www.bouncycastle.org/latest_releases.html:|https://www.bouncycastle.org/latest_releases.html] > {quote}Packaging Change (users of 1.70 or earlier): BC 1.71 changed the > jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier > JVMs, or containers/applications that cannot cope with multi-release jars, > you should now use the jdk15to18 jars. > {quote} > Please. migrate to BC ***jdk18on. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents
[ https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635958#comment-17635958 ] Tim Allison commented on TIKA-3493: --- There are some image formats that leave us with the same problem. We should do with RTF whatever we're doing there (I think leaving it without timezone ?). I had to add this [1] to allow for successful pipes "emits" to Solr, Elasticsearch and OpenSearch. [1] https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/filter/DateNormalizingMetadataFilter.java > dcterms:created date depends on the current TimeZone in RTF documents > - > > Key: TIKA-3493 > URL: https://issues.apache.org/jira/browse/TIKA-3493 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0 >Reporter: David Pilato >Assignee: Tim Allison >Priority: Minor > Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch > > > {color:#33}I'm migrating an existing project to Tika 2.0.0. > I'm seeing a strange behavior. > TL;DR: the created date of the document changes depending on the timezone. > Long story: > I have a unit test which extracts content and metadata from a [RTF > document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]]. > When using Tika 1.27, whatever the timezone defined for my JVM, I'm always > getting the same value for "dcterms:created": "2016-07-07T13:38:00Z". > When running the same test with Tika 2.0.0, the date changes depending on the > Timezone. > For example: > {color} > * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z > {color} > * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z > {color} > * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z > {color} > > {color:#33}I don't know if it's a bug or expected. May be the RTF format > does not specify the Timezone. > I'm surprised that I don't see the same behavior for Office documents > actually. > {color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635957#comment-17635957 ] ASF GitHub Bot commented on TIKA-3308: -- tballison merged PR #808: URL: https://github.com/apache/tika/pull/808 > SVG file without xml declaration tag is detected as text/plain > -- > > Key: TIKA-3308 > URL: https://issues.apache.org/jira/browse/TIKA-3308 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.25 >Reporter: Anas Hammani >Priority: Minor > Attachments: logo-luma.svg > > > The SVG file attached to the issue is interpreted as *text/plain* by > {code:java} > tika.detect(filePath){code} > > If I add > {code:java} > {code} > at the beginning of the file, then tika detects it as "image/svg+xml" > > When i read the documentation i see that xml is not necessary for a file to > be well-formed > [https://www.w3.org/TR/REC-xml/#sec-prolog-dtd] > > It will be great if tika can detect a file as a SVG without the prolog > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [tika] tballison merged pull request #808: TIKA-3308
tballison merged PR #808: URL: https://github.com/apache/tika/pull/808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents
[ https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635951#comment-17635951 ] Konstantin Gribov commented on TIKA-3493: - Just hit the same with one of the tests failing. I looked through RTF spec 1.9 and they effectively have local date/time (just wallclock without time zone) there. Right now it's interpreted as date/time in current jvm timezone. Both LibreOffice and Word (on Mac) interpret them the same. Maybe we should keep it without timezone in the metadata string (in {{dcterms:created}} or another property) and only reinterpret it with a TZ in {{Metadata#getDate}} but it would be a breaking change. Or if we can keep raw representation plus Tika's best guess what instant it meant. Likely to require breaking changes too. > dcterms:created date depends on the current TimeZone in RTF documents > - > > Key: TIKA-3493 > URL: https://issues.apache.org/jira/browse/TIKA-3493 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0 >Reporter: David Pilato >Assignee: Tim Allison >Priority: Minor > Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch > > > {color:#33}I'm migrating an existing project to Tika 2.0.0. > I'm seeing a strange behavior. > TL;DR: the created date of the document changes depending on the timezone. > Long story: > I have a unit test which extracts content and metadata from a [RTF > document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]]. > When using Tika 1.27, whatever the timezone defined for my JVM, I'm always > getting the same value for "dcterms:created": "2016-07-07T13:38:00Z". > When running the same test with Tika 2.0.0, the date changes depending on the > Timezone. > For example: > {color} > * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z > {color} > * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z > {color} > * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z > {color} > > {color:#33}I don't know if it's a bug or expected. May be the RTF format > does not specify the Timezone. > I'm surprised that I don't see the same behavior for Office documents > actually. > {color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3933) Migrate to Bouncy Castle for JDK 1.8
[ https://issues.apache.org/jira/browse/TIKA-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3933. --- Fix Version/s: 2.6.1 Resolution: Fixed > Migrate to Bouncy Castle for JDK 1.8 > > > Key: TIKA-3933 > URL: https://issues.apache.org/jira/browse/TIKA-3933 > Project: Tika > Issue Type: Task >Reporter: Valery Yatsynovich >Priority: Minor > Fix For: 2.6.1 > > > [https://www.bouncycastle.org/latest_releases.html:|https://www.bouncycastle.org/latest_releases.html] > {quote}Packaging Change (users of 1.70 or earlier): BC 1.71 changed the > jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier > JVMs, or containers/applications that cannot cope with multi-release jars, > you should now use the jdk15to18 jars. > {quote} > Please. migrate to BC ***jdk18on. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain
[ https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635918#comment-17635918 ] ASF GitHub Bot commented on TIKA-3308: -- tballison opened a new pull request, #808: URL: https://github.com/apache/tika/pull/808 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > SVG file without xml declaration tag is detected as text/plain > -- > > Key: TIKA-3308 > URL: https://issues.apache.org/jira/browse/TIKA-3308 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.25 >Reporter: Anas Hammani >Priority: Minor > Attachments: logo-luma.svg > > > The SVG file attached to the issue is interpreted as *text/plain* by > {code:java} > tika.detect(filePath){code} > > If I add > {code:java} > {code} > at the beginning of the file, then tika detects it as "image/svg+xml" > > When i read the documentation i see that xml is not necessary for a file to > be well-formed > [https://www.w3.org/TR/REC-xml/#sec-prolog-dtd] > > It will be great if tika can detect a file as a SVG without the prolog > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [tika] tballison opened a new pull request, #808: TIKA-3308
tballison opened a new pull request, #808: URL: https://github.com/apache/tika/pull/808 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3933) Migrate to Bouncy Castle for JDK 1.8
[ https://issues.apache.org/jira/browse/TIKA-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635915#comment-17635915 ] ASF GitHub Bot commented on TIKA-3933: -- tballison merged PR #807: URL: https://github.com/apache/tika/pull/807 > Migrate to Bouncy Castle for JDK 1.8 > > > Key: TIKA-3933 > URL: https://issues.apache.org/jira/browse/TIKA-3933 > Project: Tika > Issue Type: Task >Reporter: Valery Yatsynovich >Priority: Minor > > [https://www.bouncycastle.org/latest_releases.html:|https://www.bouncycastle.org/latest_releases.html] > {quote}Packaging Change (users of 1.70 or earlier): BC 1.71 changed the > jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier > JVMs, or containers/applications that cannot cope with multi-release jars, > you should now use the jdk15to18 jars. > {quote} > Please. migrate to BC ***jdk18on. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [tika] tballison merged pull request #807: TIKA-3933
tballison merged PR #807: URL: https://github.com/apache/tika/pull/807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3933) Migrate to Bouncy Castle for JDK 1.8
[ https://issues.apache.org/jira/browse/TIKA-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635887#comment-17635887 ] ASF GitHub Bot commented on TIKA-3933: -- tballison opened a new pull request, #807: URL: https://github.com/apache/tika/pull/807 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Migrate to Bouncy Castle for JDK 1.8 > > > Key: TIKA-3933 > URL: https://issues.apache.org/jira/browse/TIKA-3933 > Project: Tika > Issue Type: Task >Reporter: Valery Yatsynovich >Priority: Minor > > [https://www.bouncycastle.org/latest_releases.html:|https://www.bouncycastle.org/latest_releases.html] > {quote}Packaging Change (users of 1.70 or earlier): BC 1.71 changed the > jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier > JVMs, or containers/applications that cannot cope with multi-release jars, > you should now use the jdk15to18 jars. > {quote} > Please. migrate to BC ***jdk18on. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [tika] tballison opened a new pull request, #807: TIKA-3933
tballison opened a new pull request, #807: URL: https://github.com/apache/tika/pull/807 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (TIKA-3933) Migrate to Bouncy Castle for JDK 1.8
[ https://issues.apache.org/jira/browse/TIKA-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valery Yatsynovich updated TIKA-3933: - Description: [https://www.bouncycastle.org/latest_releases.html:|https://www.bouncycastle.org/latest_releases.html] {quote}Packaging Change (users of 1.70 or earlier): BC 1.71 changed the jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier JVMs, or containers/applications that cannot cope with multi-release jars, you should now use the jdk15to18 jars. {quote} Please. migrate to BC ***jdk18on. was: https://www.bouncycastle.org/latest_releases.html: bq. Packaging Change (users of 1.70 or earlier): BC 1.71 changed the jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier JVMs, or containers/applications that cannot cope with multi-release jars, you should now use the jdk15to18 jars. Please. migrate to BC ***jdk18on. > Migrate to Bouncy Castle for JDK 1.8 > > > Key: TIKA-3933 > URL: https://issues.apache.org/jira/browse/TIKA-3933 > Project: Tika > Issue Type: Task >Reporter: Valery Yatsynovich >Priority: Minor > > [https://www.bouncycastle.org/latest_releases.html:|https://www.bouncycastle.org/latest_releases.html] > {quote}Packaging Change (users of 1.70 or earlier): BC 1.71 changed the > jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier > JVMs, or containers/applications that cannot cope with multi-release jars, > you should now use the jdk15to18 jars. > {quote} > Please. migrate to BC ***jdk18on. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3933) Migrate to Bouncy Castle for JDK 1.8
Valery Yatsynovich created TIKA-3933: Summary: Migrate to Bouncy Castle for JDK 1.8 Key: TIKA-3933 URL: https://issues.apache.org/jira/browse/TIKA-3933 Project: Tika Issue Type: Task Reporter: Valery Yatsynovich https://www.bouncycastle.org/latest_releases.html: bq. Packaging Change (users of 1.70 or earlier): BC 1.71 changed the jdk15on jars to jdk18on so the base has now moved to Java 8. For earlier JVMs, or containers/applications that cannot cope with multi-release jars, you should now use the jdk15to18 jars. Please. migrate to BC ***jdk18on. -- This message was sent by Atlassian Jira (v8.20.10#820010)