[jira] [Commented] (TIKA-3510) tika-parser-scientific-module seems to embbed many dependencies
[ https://issues.apache.org/jira/browse/TIKA-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397680#comment-17397680 ] Hudson commented on TIKA-3510: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #308 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/308/]) TIKA-3510 -- further fixes (tallison: [https://github.com/apache/tika/commit/a2b21f85817333c4e8396713069e6b389899af82]) * (edit) tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/pom.xml * (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package/pom.xml * (edit) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/pom.xml * (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-module/pom.xml * (edit) tika-parsers/tika-parsers-extended/tika-parser-scientific-module/pom.xml > tika-parser-scientific-module seems to embbed many dependencies > --- > > Key: TIKA-3510 > URL: https://issues.apache.org/jira/browse/TIKA-3510 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0 >Reporter: Thomas Mortagne >Priority: Major > > tika-parser-scientific-module 2.0.0 contains many files from other artifacts: > * joda-time > * slf4j > * commons-io > * ... > Is that really expected ? > tika-parser-sqlite3-module seems to be affected too -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3510) tika-parser-scientific-module seems to embbed many dependencies
[ https://issues.apache.org/jira/browse/TIKA-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397641#comment-17397641 ] Tim Allison commented on TIKA-3510: --- [~tmortagne], please take a look and see if this will meet your needs. If you can recommend a more elegant solution, I'd appreciate it. I'm not thrilled with it as is. Thank you [~kkrugler] for your feedback! I went with 1/2 of it for now. Onwards! > tika-parser-scientific-module seems to embbed many dependencies > --- > > Key: TIKA-3510 > URL: https://issues.apache.org/jira/browse/TIKA-3510 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0 >Reporter: Thomas Mortagne >Priority: Major > > tika-parser-scientific-module 2.0.0 contains many files from other artifacts: > * joda-time > * slf4j > * commons-io > * ... > Is that really expected ? > tika-parser-sqlite3-module seems to be affected too -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3502) General upgrades for 2.0.1
[ https://issues.apache.org/jira/browse/TIKA-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397535#comment-17397535 ] Hudson commented on TIKA-3502: -- FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #306 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/306/]) TIKA-3502 -- general upgrades for the next 2.x version. (tallison: [https://github.com/apache/tika/commit/9e02bbfe3234bc8d31e4f20fe195ada54163514b]) * (edit) tika-parent/pom.xml > General upgrades for 2.0.1 > -- > > Key: TIKA-3502 > URL: https://issues.apache.org/jira/browse/TIKA-3502 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3522) Reduce calls to TikaConfig.getDefaultConfig
[ https://issues.apache.org/jira/browse/TIKA-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397534#comment-17397534 ] Hudson commented on TIKA-3522: -- FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #306 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/306/]) TIKA-3522 -- reduce use of TikaConfig.getDefaultConfig (tallison: [https://github.com/apache/tika/commit/0139e71e3fb27ee70d73c92764d6b7a3fdb56462]) * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java * (edit) tika-eval/tika-eval-core/src/test/java/org/apache/tika/eval/core/util/MimeUtilTest.java * (edit) tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-jdbc-commons/src/main/java/org/apache/tika/parser/jdbc/JDBCTableReader.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/TestParsers.java * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/AbstractProfiler.java * (edit) tika-parsers/tika-parsers-extended/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java * (edit) tika-parsers/tika-parsers-extended/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java > Reduce calls to TikaConfig.getDefaultConfig > --- > > Key: TIKA-3522 > URL: https://issues.apache.org/jira/browse/TIKA-3522 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > This is expensive, and we can reduce calls to it in some of our unit tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3521) Move checkActive out of fetchemitworkers within AsyncProcessor
[ https://issues.apache.org/jira/browse/TIKA-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397533#comment-17397533 ] Hudson commented on TIKA-3521: -- FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #306 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/306/]) TIKA-3521 -- move check active outside of the parse threads (tallison: [https://github.com/apache/tika/commit/76458ffddd984b699bad59a838fdc239546bdb69]) * (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java > Move checkActive out of fetchemitworkers within AsyncProcessor > -- > > Key: TIKA-3521 > URL: https://issues.apache.org/jira/browse/TIKA-3521 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 2.0.1 > > > The heartbeat check in AsyncProcessor is carried out by the parse threads. > However, this check should continue through the life of the object. The > parse threads may complete, but the emitter threads may still be active. > Let's move this to a separate thread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397523#comment-17397523 ] Luís Filipe Nassif commented on TIKA-3515: -- ??I think we should also deprecate the initialization of WriteOutContentHandler and ToTextContentHandler with only an outputstream because these call Charset.getDefaultCharset().?? Agreed, thank you [~tallison]! > Tika CLI -t should use UTF-8 as default output encoding > --- > > Key: TIKA-3515 > URL: https://issues.apache.org/jira/browse/TIKA-3515 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0.0, 1.27 > Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302 >Reporter: Luís Filipe Nassif >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0.1 > > Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, > LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, > LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, > LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 > PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, > image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png > > > Some Korean chars are extracted as squares. The encodings of plain texts are > detected correctly. Maybe this is related with the content handler (just a > guess). I'll attach the triggering files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397511#comment-17397511 ] Hudson commented on TIKA-3489: -- UNSTABLE: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #144 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/144/]) TIKA-3489 -- detect robots.txt files as text/x-robots (tallison: [https://github.com/apache/tika/commit/27e7eac5fc7c2122076237c191a2bd0aa2748aa4]) * (add) tika-parsers/src/test/resources/test-documents/testRobots.txt * (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > Robots.txt files frequently identified as message/rfc822 > > > Key: TIKA-3489 > URL: https://issues.apache.org/jira/browse/TIKA-3489 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 2.0.0, 1.25, 1.26, 1.27 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0.1, 1.27.1 > > Attachments: robots.txt > > > The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if > the file starts with a "User-Agent" rule and contains also a second rule not > too far away from the beginning, e.g.: > {noformat} > User-Agent: goodbot > Disallow: > User-Agent: badbot > Disallow: / > {noformat} > The change > [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13] > requires that two different clauses are matched. However, the two > occurrences of "User-Agent:" (initial and after a new line) are treated as > different instead of equivalent matches. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3521) Move checkActive out of fetchemitworkers within AsyncProcessor
[ https://issues.apache.org/jira/browse/TIKA-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3521. --- Fix Version/s: 2.0.1 Assignee: Tim Allison Resolution: Fixed > Move checkActive out of fetchemitworkers within AsyncProcessor > -- > > Key: TIKA-3521 > URL: https://issues.apache.org/jira/browse/TIKA-3521 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 2.0.1 > > > The heartbeat check in AsyncProcessor is carried out by the parse threads. > However, this check should continue through the life of the object. The > parse threads may complete, but the emitter threads may still be active. > Let's move this to a separate thread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3522) Reduce calls to TikaConfig.getDefaultConfig
Tim Allison created TIKA-3522: - Summary: Reduce calls to TikaConfig.getDefaultConfig Key: TIKA-3522 URL: https://issues.apache.org/jira/browse/TIKA-3522 Project: Tika Issue Type: Task Reporter: Tim Allison This is expensive, and we can reduce calls to it in some of our unit tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397490#comment-17397490 ] Hudson commented on TIKA-3515: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #305 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/305/]) TIKA-3515 -- Tika CLI -t should use UTF-8 as default output encoding (tallison: [https://github.com/apache/tika/commit/c792036e618f71fca851fd2ec90e8d23aaffd3d5]) * (edit) tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java * (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * (edit) tika-parsers/tika-parsers-ml/tika-dl/src/main/java/org/apache/tika/dl/imagerec/DL4JInceptionV3Net.java * (edit) tika-core/src/test/java/org/apache/tika/sax/RichTextContentHandlerTest.java * (edit) tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/test/java/org/apache/tika/parser/ner/NamedEntityParserTest.java * (edit) CHANGES.txt * (edit) tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java * (edit) tika-parsers/tika-parsers-ml/tika-age-recogniser/src/test/java/org/apache/tika/parser/recognition/AgeRecogniserTest.java * (edit) tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/test/java/org/apache/tika/parser/sentiment/SentimentAnalysisParserTest.java > Tika CLI -t should use UTF-8 as default output encoding > --- > > Key: TIKA-3515 > URL: https://issues.apache.org/jira/browse/TIKA-3515 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0.0, 1.27 > Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302 >Reporter: Luís Filipe Nassif >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0.1 > > Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, > LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, > LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, > LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 > PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, > image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png > > > Some Korean chars are extracted as squares. The encodings of plain texts are > detected correctly. Maybe this is related with the content handler (just a > guess). I'll attach the triggering files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3510) tika-parser-scientific-module seems to embbed many dependencies
[ https://issues.apache.org/jira/browse/TIKA-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397488#comment-17397488 ] Hudson commented on TIKA-3510: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #305 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/305/]) TIKA-3510 -- separate out modules/packages for tika-parsers-extended (tallison: [https://github.com/apache/tika/commit/509748b336a2ddc368a237d026fefeee57300325]) * (edit) tika-parsers/tika-parsers-extended/tika-parser-scientific-module/pom.xml * (edit) tika-parsers/tika-parsers-extended/pom.xml * (add) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package/pom.xml * (edit) CHANGES.txt * (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-module/pom.xml * (edit) pom.xml * (add) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/pom.xml > tika-parser-scientific-module seems to embbed many dependencies > --- > > Key: TIKA-3510 > URL: https://issues.apache.org/jira/browse/TIKA-3510 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0.0 >Reporter: Thomas Mortagne >Priority: Major > > tika-parser-scientific-module 2.0.0 contains many files from other artifacts: > * joda-time > * slf4j > * commons-io > * ... > Is that really expected ? > tika-parser-sqlite3-module seems to be affected too -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3520) Revert rendering only non-text elements in auto mode for PDFs
[ https://issues.apache.org/jira/browse/TIKA-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397489#comment-17397489 ] Hudson commented on TIKA-3520: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #305 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/305/]) TIKA-3520 -- change default rendering option to ALL (tallison: [https://github.com/apache/tika/commit/0bf273a0b3635f9399c027dce7c031088abfb0e9]) * (edit) CHANGES.txt * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java > Revert rendering only non-text elements in auto mode for PDFs > - > > Key: TIKA-3520 > URL: https://issues.apache.org/jira/browse/TIKA-3520 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.0.1 > > > In Tika 2.0.0, we changed the default behavior for the AUTO mode to render > only non-text elements. I now think we should revert this render the full > page, including text elements until we can come up with a better decision > process for automatically determining whether it would be better to render > the full page or only non-text elements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: versions?
On Wed, 11 Aug 2021, Tim Allison wrote: A) I think we should maintain the 1.x branch and continue to put out bug fixes for a bit. Any objections to nominally calling the next release 1.27.1 on JIRA at least? I agree we should probably try to keep 1.x going for at least a few months, to allow people a chance to upgrade + make associated updates for the breaking changes. If nothing else, there is bound to be dependencies that need updates for security issues! I'm +0 on not backporting new features or even new mime types, just bug fixes and security Don't mind if we the next release 1.28 or 1.27.1 B) We've made quite a few changes in the main branch since the release of 2.0.0. Would there be any objections to incrementing the MINOR version for the next release: 2.1.0? I think 2.1.0 is probably worth using, given that most users will need to read the release notes, and some (but not all) users will need to make changes for the changed defaults etc Nick
versions?
All, Two questions: A) I think we should maintain the 1.x branch and continue to put out bug fixes for a bit. Any objections to nominally calling the next release 1.27.1 on JIRA at least? B) We've made quite a few changes in the main branch since the release of 2.0.0. Would there be any objections to incrementing the MINOR version for the next release: 2.1.0? Thank you. Best, Tim P.S. Apologies for the delay in the release of the next 2.x. I've been busy with other items. I should have time to start the release process tomorrow or so.
[jira] [Resolved] (TIKA-3489) Robots.txt files frequently identified as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3489. --- Fix Version/s: 2.0.1 1.27.1 Assignee: Tim Allison Resolution: Fixed Thank you, all! > Robots.txt files frequently identified as message/rfc822 > > > Key: TIKA-3489 > URL: https://issues.apache.org/jira/browse/TIKA-3489 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 2.0.0, 1.25, 1.26, 1.27 >Reporter: Sebastian Nagel >Assignee: Tim Allison >Priority: Minor > Fix For: 1.27.1, 2.0.1 > > Attachments: robots.txt > > > The Tika MIME detector recognizes a robots.txt file as "message/rfc822" if > the file starts with a "User-Agent" rule and contains also a second rule not > too far away from the beginning, e.g.: > {noformat} > User-Agent: goodbot > Disallow: > User-Agent: badbot > Disallow: / > {noformat} > The change > [7769a2b|https://github.com/apache/tika/commit/7769a2b4fba2b4af7127eba0c7694f663fd97a13] > requires that two different clauses are matched. However, the two > occurrences of "User-Agent:" (initial and after a new line) are treated as > different instead of equivalent matches. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3520) Revert rendering only non-text elements in auto mode for PDFs
[ https://issues.apache.org/jira/browse/TIKA-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3520. --- Fix Version/s: 2.0.1 Assignee: Tim Allison Resolution: Fixed > Revert rendering only non-text elements in auto mode for PDFs > - > > Key: TIKA-3520 > URL: https://issues.apache.org/jira/browse/TIKA-3520 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.0.1 > > > In Tika 2.0.0, we changed the default behavior for the AUTO mode to render > only non-text elements. I now think we should revert this render the full > page, including text elements until we can come up with a better decision > process for automatically determining whether it would be better to render > the full page or only non-text elements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3515. --- Fix Version/s: 2.0.1 Assignee: Tim Allison Resolution: Fixed > Tika CLI -t should use UTF-8 as default output encoding > --- > > Key: TIKA-3515 > URL: https://issues.apache.org/jira/browse/TIKA-3515 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0.0, 1.27 > Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302 >Reporter: Luís Filipe Nassif >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0.1 > > Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, > LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, > LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, > LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 > PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, > image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png > > > Some Korean chars are extracted as squares. The encodings of plain texts are > detected correctly. Maybe this is related with the content handler (just a > guess). I'll attach the triggering files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3521) Move checkActive out of fetchemitworkers within AsyncProcessor
Tim Allison created TIKA-3521: - Summary: Move checkActive out of fetchemitworkers within AsyncProcessor Key: TIKA-3521 URL: https://issues.apache.org/jira/browse/TIKA-3521 Project: Tika Issue Type: Task Reporter: Tim Allison The heartbeat check in AsyncProcessor is carried out by the parse threads. However, this check should continue through the life of the object. The parse threads may complete, but the emitter threads may still be active. Let's move this to a separate thread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved TIKA-3483. Resolution: Fixed > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.1 > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached
[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397440#comment-17397440 ] Tim Allison commented on TIKA-3519: --- Can you share an example file with me? > Wonder if you can add a feature for Tika parser to stop reading metadata and > body content if certain amount of memory or body content has reached > -- > > Key: TIKA-3519 > URL: https://issues.apache.org/jira/browse/TIKA-3519 > Project: Tika > Issue Type: Wish > Components: detector >Affects Versions: 1.25, 1.26 > Environment: Linux >Reporter: Xiaohong Yang >Priority: Major > > We use org.apache.tika.parser.AutoDetectParser to get the metadata and body > content of MS office files. We encountered the following exception with some > files > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 14523048, but 500 is the maximum for this record type. If > the file is not corrupt, please open an issue on bugzilla to request > increasing the maximum allowable size for this record type. As a temporary > workaround, consider setting a higher override value with > IOUtils.setByteArrayMaxOverride() > > To resolve the problem we set byteArrayMaxOverride in the tika-config.xml > file as follows > > > > type="int">2000 > > > > This helped to parse some files that failed previously. But some other files > still failed. And then we increased the value to 200 MB and 500 MB. > > Some other file may still fail with byteArrayMaxOverride set to 500 MB. So > we wonder if you can add a feature to the Tika parser for it to stop reading > metadata and body content if certain amount of memory or body content has > reached. The parser will return the metadata and body content obtained so > far. A warning message will be returned to the caller if this happens. This > will help us to get the metadata and body content from some files that > requires a lot of memory. We may not be able to successfully parse some > files without this feature because those files fail somewhere else with the > out-of-memory error after we set byteArrayMaxOverride to very high values and > the above mentioned failure does not happen. With this feature we will get > truncated body content with some files but it is better than get nothing. > Actually we will truncate the body content ourselves if it is too large. So > we do not care if the body content is truncated if it reaches certain amount. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-3483: --- Fix Version/s: (was: 2.0.0-BETA) 2.0.1 > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.1 > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397439#comment-17397439 ] ASF GitHub Bot commented on TIKA-3483: -- lewismc merged pull request #5: URL: https://github.com/apache/tika-helm/pull/5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.0-BETA > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika-helm] lewismc merged pull request #5: [TIKA-3483] Implement a network policy for Helm Chart
lewismc merged pull request #5: URL: https://github.com/apache/tika-helm/pull/5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3483) Implement a network policy for Helm Chart
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397438#comment-17397438 ] ASF GitHub Bot commented on TIKA-3483: -- lewismc commented on pull request #5: URL: https://github.com/apache/tika-helm/pull/5#issuecomment-896946612 @bynare apologies I just ended up doing other things... I wasn't ignoring this. Thanks for your patience. LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Implement a network policy for Helm Chart > - > > Key: TIKA-3483 > URL: https://issues.apache.org/jira/browse/TIKA-3483 > Project: Tika > Issue Type: Improvement > Components: helm >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 2.0.0-BETA > > > See https://github.com/apache/tika-helm/pull/5 for context -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika-helm] lewismc commented on pull request #5: [TIKA-3483] Implement a network policy for Helm Chart
lewismc commented on pull request #5: URL: https://github.com/apache/tika-helm/pull/5#issuecomment-896946612 @bynare apologies I just ended up doing other things... I wasn't ignoring this. Thanks for your patience. LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached
[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397427#comment-17397427 ] Xiaohong Yang commented on TIKA-3519: - Can you check if you can catch the above mentioned ByteArrayMaxOverride error (Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 14523048, but 500 is the maximum for this record type…), stop parsing and then write the available body content to the contenthandler so that we can have the body content parsed so far? > Wonder if you can add a feature for Tika parser to stop reading metadata and > body content if certain amount of memory or body content has reached > -- > > Key: TIKA-3519 > URL: https://issues.apache.org/jira/browse/TIKA-3519 > Project: Tika > Issue Type: Wish > Components: detector >Affects Versions: 1.25, 1.26 > Environment: Linux >Reporter: Xiaohong Yang >Priority: Major > > We use org.apache.tika.parser.AutoDetectParser to get the metadata and body > content of MS office files. We encountered the following exception with some > files > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 14523048, but 500 is the maximum for this record type. If > the file is not corrupt, please open an issue on bugzilla to request > increasing the maximum allowable size for this record type. As a temporary > workaround, consider setting a higher override value with > IOUtils.setByteArrayMaxOverride() > > To resolve the problem we set byteArrayMaxOverride in the tika-config.xml > file as follows > > > > type="int">2000 > > > > This helped to parse some files that failed previously. But some other files > still failed. And then we increased the value to 200 MB and 500 MB. > > Some other file may still fail with byteArrayMaxOverride set to 500 MB. So > we wonder if you can add a feature to the Tika parser for it to stop reading > metadata and body content if certain amount of memory or body content has > reached. The parser will return the metadata and body content obtained so > far. A warning message will be returned to the caller if this happens. This > will help us to get the metadata and body content from some files that > requires a lot of memory. We may not be able to successfully parse some > files without this feature because those files fail somewhere else with the > out-of-memory error after we set byteArrayMaxOverride to very high values and > the above mentioned failure does not happen. With this feature we will get > truncated body content with some files but it is better than get nothing. > Actually we will truncate the body content ourselves if it is too large. So > we do not care if the body content is truncated if it reaches certain amount. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3520) Revert rendering only non-text elements in auto mode for PDFs
Tim Allison created TIKA-3520: - Summary: Revert rendering only non-text elements in auto mode for PDFs Key: TIKA-3520 URL: https://issues.apache.org/jira/browse/TIKA-3520 Project: Tika Issue Type: Task Reporter: Tim Allison In Tika 2.0.0, we changed the default behavior for the AUTO mode to render only non-text elements. I now think we should revert this render the full page, including text elements until we can come up with a better decision process for automatically determining whether it would be better to render the full page or only non-text elements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3515: -- Affects Version/s: (was: 2.0.0-BETA) 2.0.0 > Tika CLI -t should use UTF-8 as default output encoding > --- > > Key: TIKA-3515 > URL: https://issues.apache.org/jira/browse/TIKA-3515 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0.0, 1.27 > Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302 >Reporter: Luís Filipe Nassif >Priority: Minor > Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, > LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, > LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, > LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 > PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, > image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png > > > Some Korean chars are extracted as squares. The encodings of plain texts are > detected correctly. Maybe this is related with the content handler (just a > guess). I'll attach the triggering files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3515) Tika CLI -t should use UTF-8 as default output encoding
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397391#comment-17397391 ] Tim Allison commented on TIKA-3515: --- If we're going to make this change in tika-app, I think we should also deprecate the initialization of WriteOutContentHandler and ToTextContentHandler with only an outputstream because these call Charset.getDefaultCharset(). We can also clean up defaultcharset in some of our unit tests. I'm concerned about what might happen if we try to change then in the translators...I'll leave those alone. If anyone has objections to any of the above, let me know. > Tika CLI -t should use UTF-8 as default output encoding > --- > > Key: TIKA-3515 > URL: https://issues.apache.org/jira/browse/TIKA-3515 > Project: Tika > Issue Type: Improvement >Affects Versions: 1.27, 2.0.0-BETA > Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302 >Reporter: Luís Filipe Nassif >Priority: Minor > Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, > LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, > LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, > LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 > PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, > image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png > > > Some Korean chars are extracted as squares. The encodings of plain texts are > detected correctly. Maybe this is related with the content handler (just a > guess). I'll attach the triggering files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached
[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397365#comment-17397365 ] Tim Allison commented on TIKA-3519: --- If the underlying parser (Apache POI in this case) writes content to the contenthandler before a writelimitexception, you _should_ be able to retrieve that text and metadata. If the underlying parser needs to parse the full file and hits this exception before writing to the contenthandler, then there's not much we can do. > Wonder if you can add a feature for Tika parser to stop reading metadata and > body content if certain amount of memory or body content has reached > -- > > Key: TIKA-3519 > URL: https://issues.apache.org/jira/browse/TIKA-3519 > Project: Tika > Issue Type: Wish > Components: detector >Affects Versions: 1.25, 1.26 > Environment: Linux >Reporter: Xiaohong Yang >Priority: Major > > We use org.apache.tika.parser.AutoDetectParser to get the metadata and body > content of MS office files. We encountered the following exception with some > files > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 14523048, but 500 is the maximum for this record type. If > the file is not corrupt, please open an issue on bugzilla to request > increasing the maximum allowable size for this record type. As a temporary > workaround, consider setting a higher override value with > IOUtils.setByteArrayMaxOverride() > > To resolve the problem we set byteArrayMaxOverride in the tika-config.xml > file as follows > > > > type="int">2000 > > > > This helped to parse some files that failed previously. But some other files > still failed. And then we increased the value to 200 MB and 500 MB. > > Some other file may still fail with byteArrayMaxOverride set to 500 MB. So > we wonder if you can add a feature to the Tika parser for it to stop reading > metadata and body content if certain amount of memory or body content has > reached. The parser will return the metadata and body content obtained so > far. A warning message will be returned to the caller if this happens. This > will help us to get the metadata and body content from some files that > requires a lot of memory. We may not be able to successfully parse some > files without this feature because those files fail somewhere else with the > out-of-memory error after we set byteArrayMaxOverride to very high values and > the above mentioned failure does not happen. With this feature we will get > truncated body content with some files but it is better than get nothing. > Actually we will truncate the body content ourselves if it is too large. So > we do not care if the body content is truncated if it reaches certain amount. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached
[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397345#comment-17397345 ] Xiaohong Yang commented on TIKA-3519: - I tried org.apache.tika.sax.WriteOutContentHandler with writeLimit in a test program and found out that this is one of the features we want. However I noticed that this approach (setting of writeLimit) does not help to avoid the ByteArrayMaxOverride error mentioned in the ticket (Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 14523048, but 500 is the maximum for this record type…). I also noticed that if the ByteArrayMaxOverride error happens we do not get any body text regardless the value of writeLimit. When the ByteArrayMaxOverride error happens we can catch the exception and get the required override value from the stack trace, and then set the required override value with IOUtils.setByteArrayMaxOverride() and try the parse method again (it will probably succeed if the machine has enough memory). However we wonder if you can add a feature so that the body text is still available when the ByteArrayMaxOverride error happens so that we can decide to try again or use the available body text (and metadata) depending on the required override value because a very higher value may not be feasible for reasons like there is not enough memory available on the machine. > Wonder if you can add a feature for Tika parser to stop reading metadata and > body content if certain amount of memory or body content has reached > -- > > Key: TIKA-3519 > URL: https://issues.apache.org/jira/browse/TIKA-3519 > Project: Tika > Issue Type: Wish > Components: detector >Affects Versions: 1.25, 1.26 > Environment: Linux >Reporter: Xiaohong Yang >Priority: Major > > We use org.apache.tika.parser.AutoDetectParser to get the metadata and body > content of MS office files. We encountered the following exception with some > files > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 14523048, but 500 is the maximum for this record type. If > the file is not corrupt, please open an issue on bugzilla to request > increasing the maximum allowable size for this record type. As a temporary > workaround, consider setting a higher override value with > IOUtils.setByteArrayMaxOverride() > > To resolve the problem we set byteArrayMaxOverride in the tika-config.xml > file as follows > > > > type="int">2000 > > > > This helped to parse some files that failed previously. But some other files > still failed. And then we increased the value to 200 MB and 500 MB. > > Some other file may still fail with byteArrayMaxOverride set to 500 MB. So > we wonder if you can add a feature to the Tika parser for it to stop reading > metadata and body content if certain amount of memory or body content has > reached. The parser will return the metadata and body content obtained so > far. A warning message will be returned to the caller if this happens. This > will help us to get the metadata and body content from some files that > requires a lot of memory. We may not be able to successfully parse some > files without this feature because those files fail somewhere else with the > out-of-memory error after we set byteArrayMaxOverride to very high values and > the above mentioned failure does not happen. With this feature we will get > truncated body content with some files but it is better than get nothing. > Actually we will truncate the body content ourselves if it is too large. So > we do not care if the body content is truncated if it reaches certain amount. -- This message was sent by Atlassian Jira (v8.3.4#803005)