[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4283: -- Component/s: core parser > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature > Components: core, parser >Affects Versions: 2.9.2 >Reporter: Lonzak >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4283: -- Affects Version/s: 2.9.2 > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature >Affects Versions: 2.9.2 >Reporter: Lonzak >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4283. --- Assignee: Tilman Hausherr Resolution: Fixed Done, it's now in 2.* as well, thanks. > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature > Components: core, parser >Affects Versions: 2.9.2 >Reporter: Lonzak >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4283: -- Fix Version/s: 3.0.0 > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature >Reporter: Lonzak >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4285) Invalid Link for changelog CHANGES.txt files
[ https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867684#comment-17867684 ] Tilman Hausherr commented on TIKA-4285: --- Additionally: the 3.0.0-BETA2 link works, however the text mentions "Tika 2.9.2". > Invalid Link for changelog CHANGES.txt files > > > Key: TIKA-4285 > URL: https://issues.apache.org/jira/browse/TIKA-4285 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.0, 2.9.1, 2.9.2 >Reporter: Lonzak >Priority: Major > > On the tika [start page|https://tika.apache.org/] the linked change log files > CHANGES.txt starting with version 2.9.0 are missing/broken. > > {+}Working{+}: > https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt > +Not working:+ > https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt > https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt > https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
[ https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4284. - Resolution: Invalid > [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and > strudl.0.3.13 > --- > > Key: TIKA-4284 > URL: https://issues.apache.org/jira/browse/TIKA-4284 > Project: Tika > Issue Type: Bug >Reporter: Abhijit Rajwade >Priority: Major > Labels: SECURITY > > CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13 > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : > [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. > === > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : strudl.0.3.13 : [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
[ https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867236#comment-17867236 ] Tilman Hausherr commented on TIKA-4284: --- How is this related to Tika? What subproject uses activemq-osgi-5.17.6 and strudl.0.3.13? > [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and > strudl.0.3.13 > --- > > Key: TIKA-4284 > URL: https://issues.apache.org/jira/browse/TIKA-4284 > Project: Tika > Issue Type: Bug >Reporter: Abhijit Rajwade >Priority: Major > Labels: SECURITY > > CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13 > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : > [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. > === > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : strudl.0.3.13 : [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Description: The latest h2 version (which needs jdk11) brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 was: The latest h2 (which needs jdk11) version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > The latest h2 version (which needs jdk11) brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4282. --- Resolution: Fixed > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4282) Syntax error with h2 version 2.3.230
Tilman Hausherr created TIKA-4282: - Summary: Syntax error with h2 version 2.3.230 Key: TIKA-4282 URL: https://issues.apache.org/jira/browse/TIKA-4282 Project: Tika Issue Type: Bug Components: tika-eval Affects Versions: 3.0.0-BETA Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 3.0.0 The latest h2 version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Description: The latest h2 (which needs jdk11) version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 was: The latest h2 version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Affects Version/s: 2.9.2 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Fix Version/s: 2.9.3 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-1155) Number Format is converted with an error
[ https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-1155. - Resolution: Cannot Reproduce Closing because it can no longer be reproduced, it has probably been fixed either by us or in POI. Please comment and/or reopen if you disagree. > Number Format is converted with an error > > > Key: TIKA-1155 > URL: https://issues.apache.org/jira/browse/TIKA-1155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Evgeniy Buyanov >Priority: Major > Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml > > Original Estimate: 2h > Remaining Estimate: 2h > > {code:Title=Source data} > ><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* > "-"\ _B_F_-;_-@_-"/> > > 10 > -10 > {code} > java -jar tika-app-1.4.jar test.xlsx > test.xml > {code:Title=Result} > * 10 _F > -10 _F > {code} > related ASF Bugzilla – Bug > [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-1155) Number Format is converted with an error
[ https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866408#comment-17866408 ] Tilman Hausherr commented on TIKA-1155: --- Current output: {code:xml} Sheet1 10 - 10 - text {code} Looks like this on the screen: !screenshot-1.png! > Number Format is converted with an error > > > Key: TIKA-1155 > URL: https://issues.apache.org/jira/browse/TIKA-1155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Evgeniy Buyanov >Priority: Major > Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml > > Original Estimate: 2h > Remaining Estimate: 2h > > {code:Title=Source data} > ><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* > "-"\ _B_F_-;_-@_-"/> > > 10 > -10 > {code} > java -jar tika-app-1.4.jar test.xlsx > test.xml > {code:Title=Result} > * 10 _F > -10 _F > {code} > related ASF Bugzilla – Bug > [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-1155) Number Format is converted with an error
[ https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1155: -- Attachment: screenshot-1.png > Number Format is converted with an error > > > Key: TIKA-1155 > URL: https://issues.apache.org/jira/browse/TIKA-1155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Evgeniy Buyanov >Priority: Major > Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml > > Original Estimate: 2h > Remaining Estimate: 2h > > {code:Title=Source data} > ><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* > "-"\ _B_F_-;_-@_-"/> > > 10 > -10 > {code} > java -jar tika-app-1.4.jar test.xlsx > test.xml > {code:Title=Result} > * 10 _F > -10 _F > {code} > related ASF Bugzilla – Bug > [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-3028) Failed test at SAS7BDATParserTest:112
[ https://issues.apache.org/jira/browse/TIKA-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-3028. - Resolution: Cannot Reproduce Closing for now because of no activity for years, please reopen if it still happens. I remember I had several problems in my early months as a committer with a german locale, and we did some fixes in the code and some configuration changes in my IDE. > Failed test at SAS7BDATParserTest:112 > - > > Key: TIKA-3028 > URL: https://issues.apache.org/jira/browse/TIKA-3028 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.23 >Reporter: Wknds >Priority: Blocker > Attachments: Bildschirmfoto 2020-01-24 um 23.12.20.png > > > Test fails at > SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107. > Expected date is _01Jan1960:00:00_ > while the dates in the (untouched) test file are abbreviated by an '.' on my > system (please refer to the terminal output below). > {code:java} > // code placeholder > [ERROR] Failures: > [ERROR] > SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107 > 01Jan1960:00:00 not found in: > TESTING Record Number Square of the Record Number Description of > the Row Percent DonePercent Increment datedatetimetime >0 0 This is row0 of 100% > 01-01-1960 01Jan.1960:00:00:01.00 00:00:011 1 This > is row1 of 1010% 0.0%02-01-1960 > 01Jan.1960:00:00:10.00 00:00:032 4 This is row > 2 of 1020% 50.0% 17-01-1960 > 01Jan.1960:00:01:40.00 00:00:093 9 This is row > 3 of 1030% 66.7% 22-03-1960 > 01Jan.1960:00:16:40.00 00:00:274 16 This is row > 4 of 1040% 75.0% 13-09-1960 > 01Jan.1960:02:46:40.00 00:01:215 25 This is row > 5 of 1050% 80.0% 17-09-1961 > 02Jan.1960:03:46:40.00 00:04:036 36 This is row > 6 of 1060% 83.3% 20-07-1963 > 12Jan.1960:13:46:40.00 00:12:097 49 This is row > 7 of 1070% 85.7% 29-07-1966 > 25Apr.1960:17:46:40.00 00:36:278 64 This is row > 8 of 1080% 87.5% 20-03-1971 > 03März1963:09:46:40.00 01:49:219 81 This is row > 9 of 1090% 88.9% 18-12-1977 > 09Sep.1991:01:46:40.00 05:28:0310 100 This is row > 10 of 10100%90.0% 19-05-1987 > 19Nov.2276:17:46:40.00 16:24:09 > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3290) Extension reading it as eml instead of txt
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3290: -- Fix Version/s: (was: 1.24.1) > Extension reading it as eml instead of txt > -- > > Key: TIKA-3290 > URL: https://issues.apache.org/jira/browse/TIKA-3290 > Project: Tika > Issue Type: Bug > Components: core, mime >Affects Versions: 1.25 >Reporter: Tika User >Priority: Major > Labels: tika-parsers > Attachments: image-2021-02-22-10-13-08-447.png, > image-2021-02-23-12-39-00-778.png, test_sample_message.txt > > > The attached file extension is reading it as eml instead of txt. With version > 1.24.1 it is reading it as txt and now with the upgrade to 1.25, it is > reading it as eml. So that while parsing we are getting mail corrupted error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3172) PDF Parser configuration enable auto space using tika config file
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-3172. --- Fix Version/s: 1.25 Assignee: Tim Allison Resolution: Fixed > PDF Parser configuration enable auto space using tika config file > - > > Key: TIKA-3172 > URL: https://issues.apache.org/jira/browse/TIKA-3172 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.24.1 >Reporter: Akash >Assignee: Tim Allison >Priority: Major > Fix For: 1.25 > > > Need information on how to set enableAutoSpace using tika config file. > {code:java} > / > > > > > > > false > > > > / > {code} > Above configuration is not working. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-3155) Parse Error while extracting CSV files
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-3155. - Resolution: Duplicate Closing as duplicate of TIKA-4278. This isn't a CSV file by the improved logic. > Parse Error while extracting CSV files > -- > > Key: TIKA-3155 > URL: https://issues.apache.org/jira/browse/TIKA-3155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24.1 >Reporter: Akash >Priority: Major > Attachments: UTF-8_chars.csv > > > We are getting parse error while trying to extract csv files. > This was working in version 1.9, but exception coming in 1.24.1 > > {code:java} > /Exception in thread "main" org.apache.tika.exception.TikaException: > exception parsing the csv > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 > undefined) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 > undefined) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined) > Caused by: java.lang.IllegalStateException: IOException reading next record: > java.io.IOException: (startline 39) EOF reached before encapsulated token > finished > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 > undefined) > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 > undefined) > ... 6 more > Caused by: java.io.IOException: (startline 39) EOF reached before > encapsulated token finished > at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 > undefined) > at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined) > at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142 > undefined)/ > {code} > Issue is coming when we encounter double quotes in one of the cells. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866277#comment-17866277 ] Tilman Hausherr commented on TIKA-4278: --- If colon and another delimiter have been detected with the same confidence, use the other one. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Attachment: reports_csv_2.9.2_vs_2.9.3_4.tar.xz > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866147#comment-17866147 ] Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:40 PM: I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: false colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. We can still change it after the "big" regression tests. was (Author: tilman): I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. We can still change it after the "big" regression tests. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866147#comment-17866147 ] Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:24 PM: I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. We can still change it after the "big" regression tests. was (Author: tilman): I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866147#comment-17866147 ] Tilman Hausherr commented on TIKA-4278: --- I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Attachment: reports_csv_2.9.2_vs_2.9.3_3.tar.xz > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866059#comment-17866059 ] Tilman Hausherr commented on TIKA-4278: --- Many files are detected as csv that are not, e.g. govdocs1/040/040251.txt govdocs1/242/242970.txt, govdocs1/001/001605.txt: now has a ":" as separator although it's obvious that it's a ",". Maybe because of TIME_HH:MM:SS?! govdocs1/346/346152.txt is considered to be pipe-separated, despite that it's a text file, although it's a table. IMHO it shouldn't "detect" something that isn't in the first line. This would also solve the problem with govdocs1/040/040251.txt . govdocs1/113/113291.txt: claims that it contains "컴컴" but it doesn't. I assume this is a different change than mine because my changes aren't related to the encoding. I'll rerun the tests with a change that returns 0 confidence in CSVSniffer when the delimiter is not in row zero. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Attachment: reports_csv_2.9.2_vs_2.9.3.tar.xz > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4278. --- Fix Version/s: 3.0.0 2.9.3 Assignee: Tilman Hausherr Resolution: Fixed > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Labels: csv csvparser (was: ) > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865884#comment-17865884 ] Tilman Hausherr commented on TIKA-4278: --- The next build did work. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Major > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Description: I ran the code from the attached SO issue and yes it doesn't detect semicolon separated files. The reason is this line in {{TextAndCSVParser.java}}: {code:java} private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; {code} This is later used by {{CSVSniffer}}. For some reason the other delimiters (pipe, colon and semicolon) aren't in that array, although they are in {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it works for semicolon. Can I change this by adding the missing delimiters or was there a reason that I missed? Proposed change would change CSVSniffer so that delimiters is a set and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. was: I ran the code from the attached SO issue and yes it doesn't detect semicolon separated files. The reason is this line in {{TextAndCSVParser.java}}: {code:java} private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; {code} This is later uses by {{CSVSniffer}}. For some reason the other delimiters (pipe, colon and semicolon) aren't in that array, although they are in {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it works for semicolon. Can I change this by adding the missing delimiters or was there a reason that I missed? Proposed change would change CSVSniffer so that delimiters is a set and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Major > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Description: I ran the code from the attached SO issue and yes it doesn't detect semicolon separated files. The reason is this line in {{TextAndCSVParser.java}}: {code:java} private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; {code} This is later uses by {{CSVSniffer}}. For some reason the other delimiters (pipe, colon and semicolon) aren't in that array, although they are in {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it works for semicolon. Can I change this by adding the missing delimiters or was there a reason that I missed? Proposed change would change CSVSniffer so that delimiters is a set and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. was: I ran the code from the attached SO issue and yes it doesn't detect semicolon separated files. The reason is this line in {{TextAndCSVParser.java}}: {code:java} private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; {code} This is later uses by {{CSVSniffer}}. For some reason the other delimiters (pipe, colon and semicolon) aren't in that array, although they are in {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it works for semicolon. Can I change this by adding the missing delimiters or was there a reason that I missed? > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Major > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later uses by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2833) Add a CSV/TSV detector
[ https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865679#comment-17865679 ] Tilman Hausherr commented on TIKA-2833: --- [~joshm] please create a new ticket. Alternatively use {{TextAndCSVParser}} which can detect some csv files but not all, see TIKA-4278. > Add a CSV/TSV detector > -- > > Key: TIKA-2833 > URL: https://issues.apache.org/jira/browse/TIKA-2833 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 1.21 > > Attachments: csv_reports.zip > > > Given initial experimentation, I think we can fairly easily add a fairly > robust CSV/TSV detector that will identify well-formed (ha!) csvs and return > the charset encoding and the delimiter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-2833) Add a CSV/TSV detector
[ https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-2833: -- Fix Version/s: 1.21 > Add a CSV/TSV detector > -- > > Key: TIKA-2833 > URL: https://issues.apache.org/jira/browse/TIKA-2833 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 1.21 > > Attachments: csv_reports.zip > > > Given initial experimentation, I think we can fairly easily add a fairly > robust CSV/TSV detector that will identify well-formed (ha!) csvs and return > the charset encoding and the delimiter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
Tilman Hausherr created TIKA-4278: - Summary: TextAndCSVParser doesn't detect semicolon separated file Key: TIKA-4278 URL: https://issues.apache.org/jira/browse/TIKA-4278 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.9.2 Reporter: Tilman Hausherr I ran the code from the attached SO issue and yes it doesn't detect semicolon separated files. The reason is this line in {{TextAndCSVParser.java}}: {code:java} private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; {code} This is later uses by {{CSVSniffer}}. For some reason the other delimiters (pipe, colon and semicolon) aren't in that array, although they are in {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it works for semicolon. Can I change this by adding the missing delimiters or was there a reason that I missed? -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1
+1 built on windows 10 jdk11 Before releasing the real 3.0.0 we need to remove any "-M" dependencies (I've added these so we support these other projects by testing them), and decide about the ffmpeg issue and the hdf5 issue. Tilman On 12.07.2024 18:08, Tim Allison wrote: A candidate for the Tika 3.0.0-BETA2 release is available at: https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA2 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/3.0.0-BETA2-rc1/ The SHA-512 checksum of the archive is 8a4142f61110f196c550146637994d26f66d6c798fc9e1d18dcadcb8a8fe817a52f59f3a03341809131f59b644fa2e183212bdee5f292d3d603d1a5a893c6848. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1105/org/apache/tika Please vote on releasing this package as Apache Tika 3.0.0-BETA2. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 3.0.0-BETA2 [ ] -1 Do not release this package because... Here's my +1. Thank you, all! Best, Tim
[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865340#comment-17865340 ] Tilman Hausherr commented on TIKA-4277: --- Done. > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Labels: config.xml > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4277: -- Labels: config.xml (was: ) > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Labels: config.xml > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4277. - Resolution: Duplicate > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Labels: config.xml > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865297#comment-17865297 ] Tilman Hausherr commented on TIKA-4277: --- To see what parameters are available and how to use them, do this: {noformat} java -jar tika-app-VERSION.jar --config=config.xml --dump-current-config {noformat} I get this: {code:xml} true 0.3 true true 2.5 true true false true true false false false false false true false NONE 10 536870912 300 png 1.0 GRAY ALL AUTO 10,10 false false true 0.5 false false {code} > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4276) Tika fails to detect damaged pdf
[ https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4276. - Resolution: Not A Bug > Tika fails to detect damaged pdf > > > Key: TIKA-4276 > URL: https://issues.apache.org/jira/browse/TIKA-4276 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Xiaohong Yang >Priority: Major > > We use Tika to check file type and extension. However, with some damaged pdf > files Tika detects them as text file. > Wonder if you can make Tika detect the damaged pdf file as pdf file type and > extension. > Following is the sample code and the link to the tika-config.xml and the > sample PDF file is > [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es] > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2 and POI version is 5.2.3. > > > {code:java} > import org.apache.tika.config.TikaConfig; > import org.apache.tika.detect.Detector; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.mime.MediaType; > import org.apache.tika.mime.MimeType; > > import java.io.FileInputStream; > > public class DetectDamagedPDF { > > public static void main(String args[]) { > try { > String filePath = > "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf"; > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml"); > Detector detector = config.getDetector(); > Metadata metadata = new Metadata(); > FileInputStream fis = new FileInputStream(filePath); > TikaInputStream stream = TikaInputStream.get(fis); > metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath); > MediaType mediaType = detector.detect(stream, metadata); > MimeType mimeType = > config.getMimeRepository().forName(mediaType.toString()); > String tikaExtension = mimeType.getExtension(); > System.out.println("tikaExtension = " + tikaExtension); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143 ] Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 7:10 PM: You should add / integrate something like this: {code:xml} true {code} was (Author: tilman): You should add / integrate something like this: {code:xml} true {code} > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143 ] Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 7:00 PM: You should add / integrate something like this: {code:xml} true {code} was (Author: tilman): You should add / integrate something like this: {code:xml} true {code} > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143 ] Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 3:53 PM: You should add / integrate something like this: {code:xml} true {code} was (Author: tilman): You should add / integrate something like this: {code:xml} true {code} > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143 ] Tilman Hausherr commented on TIKA-4277: --- You should add / integrate something like this: {code:xml} true {code} > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865142#comment-17865142 ] Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 3:48 PM: Please attach your config.xml, or are you using default settings? was (Author: tilman): Please attach your config.xml > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated
[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865142#comment-17865142 ] Tilman Hausherr commented on TIKA-4277: --- Please attach your config.xml > PDF parse issue for text rotated > > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: ragebear >Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-723) Rotated text isn't extracted correctly from PDFs
[ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-723. Resolution: Duplicate Duplicate of TIKA-2779 > Rotated text isn't extracted correctly from PDFs > > > Key: TIKA-723 > URL: https://issues.apache.org/jira/browse/TIKA-723 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Priority: Minor > Attachments: rotated.pdf > > > I have an example PDF with 90 degree rotation; Tika produces the > characters one line at a time. Ie, the doc has "Some rotated text, > here!" but Tika produces this: > {noformat} > So > m > e > > r > o > t > a > t > e > d > > t > e > x > t > , > > h > e > r > e > ! > {noformat} > I'm able to copy/paste the text out correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4276) Tika fails to detect damaged pdf
[ https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4276: -- Description: We use Tika to check file type and extension. However, with some damaged pdf files Tika detects them as text file. Wonder if you can make Tika detect the damaged pdf file as pdf file type and extension. Following is the sample code and the link to the tika-config.xml and the sample PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es] The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2 and POI version is 5.2.3. {code:java} import org.apache.tika.config.TikaConfig; import org.apache.tika.detect.Detector; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MimeType; import java.io.FileInputStream; public class DetectDamagedPDF { public static void main(String args[]) { try { String filePath = "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf"; TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml"); Detector detector = config.getDetector(); Metadata metadata = new Metadata(); FileInputStream fis = new FileInputStream(filePath); TikaInputStream stream = TikaInputStream.get(fis); metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath); MediaType mediaType = detector.detect(stream, metadata); MimeType mimeType = config.getMimeRepository().forName(mediaType.toString()); String tikaExtension = mimeType.getExtension(); System.out.println("tikaExtension = " + tikaExtension); } catch(Exception ex) { ex.printStackTrace(); } } } {code} was: We use Tika to check file type and extension. However, with some damaged pdf files Tika detects them as text file. Wonder if you can make Tika detect the damaged pdf file as pdf file type and extension. Following is the sample code and the link to the tika-config.xml and the sample PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es] The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2 and POI version is 5.2.3. import org.apache.tika.config.TikaConfig; import org.apache.tika.detect.Detector; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MimeType; import java.io.FileInputStream; public class DetectDamagedPDF { public static void main(String args[]) { try { String filePath = "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf"; TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml"); Detector detector = config.getDetector(); Metadata metadata = new Metadata(); FileInputStream fis = new FileInputStream(filePath); TikaInputStream stream = TikaInputStream.get(fis); metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath); MediaType mediaType = detector.detect(stream, metadata); MimeType mimeType = config.getMimeRepository().forName(mediaType.toString()); String tikaExtension = mimeType.getExtension(); System.out.println("tikaExtension = " + tikaExtension); } catch(Exception ex) { ex.printStackTrace(); } } } > Tika fails to detect damaged pdf > > > Key: TIKA-4276 > URL: https://issues.apache.org/jira/browse/TIKA-4276 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Xiaohong Yang >Priority: Major > > We use Tika to check file type and extension. However, with some damaged pdf > files Tika detects them as text file. > Wonder if you can make Tika detect the damaged pdf file as pdf file type and > extension. > Following is the sample code and the link to the tika-config.xml and the > sample PDF file is > [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es] > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2 and POI version is 5.2.3. > > > {code:java} > import org.apache.tika.config.TikaConfig; > import org.apache.tika.detect.Detector; > import org.apache.tika.io.TikaInputStream; > impo
[jira] [Commented] (TIKA-4276) Tika fails to detect damaged pdf
[ https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864670#comment-17864670 ] Tilman Hausherr commented on TIKA-4276: --- Your file starts with "1 0 obj" instead of with "%PDF" so I'd say this isn't a bug. The file is truncated at the beginning, and it could be truncated anywhere. We'd need countless magic numbers. > Tika fails to detect damaged pdf > > > Key: TIKA-4276 > URL: https://issues.apache.org/jira/browse/TIKA-4276 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Xiaohong Yang >Priority: Major > > We use Tika to check file type and extension. However, with some damaged pdf > files Tika detects them as text file. > Wonder if you can make Tika detect the damaged pdf file as pdf file type and > extension. > Following is the sample code and the link to the tika-config.xml and the > sample PDF file is > [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es] > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2 and POI version is 5.2.3. > > > import org.apache.tika.config.TikaConfig; > import org.apache.tika.detect.Detector; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.mime.MediaType; > import org.apache.tika.mime.MimeType; > > import java.io.FileInputStream; > > public class DetectDamagedPDF { > > public static void main(String args[]) { > try > { String filePath = > "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf"; > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml"); > Detector detector = config.getDetector(); Metadata > metadata = new Metadata(); FileInputStream fis = new > FileInputStream(filePath); TikaInputStream stream = > TikaInputStream.get(fis); > metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath); > MediaType mediaType = detector.detect(stream, metadata); MimeType > mimeType = config.getMimeRepository().forName(mediaType.toString()); > String tikaExtension = mimeType.getExtension(); > System.out.println("tikaExtension = " + tikaExtension); } > catch(Exception ex) > { ex.printStackTrace(); } > } > } > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4274) Improve ExtractReaderException
[ https://issues.apache.org/jira/browse/TIKA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4274. --- Resolution: Fixed > Improve ExtractReaderException > -- > > Key: TIKA-4274 > URL: https://issues.apache.org/jira/browse/TIKA-4274 > Project: Tika > Issue Type: Improvement > Components: tika-eval >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > I saw this stack trace in the eval log and it's not really helpful > {noformat} > org.apache.tika.eval.app.io.ExtractReaderException > at > org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:125) > at > org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198) > at > org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180) > at > org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152) > at > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87) > at > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} > so I'm adding the type, the cause and also some logging for > EXTRACT_FILE_TOO_SHORT / EXTRACT_FILE_TOO_LONG so that we can know what this > is about, and then do something (or not) about it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4274) Improve ExtractReaderException
[ https://issues.apache.org/jira/browse/TIKA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863552#comment-17863552 ] Tilman Hausherr commented on TIKA-4274: --- new output: {noformat} INFO [pool-3-thread-4] 11:41:41,973 org.apache.tika.eval.app.io.ExtractReader maxExtractLength 200 > IGNORE_LENGTH -1 and length 2587452 > maxExtractLength 200 org.apache.tika.eval.app.io.ExtractReaderException: EXTRACT_FILE_TOO_LONG at org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:129) at org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198) at org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) {noformat} > Improve ExtractReaderException > -- > > Key: TIKA-4274 > URL: https://issues.apache.org/jira/browse/TIKA-4274 > Project: Tika > Issue Type: Improvement > Components: tika-eval >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > I saw this stack trace in the eval log and it's not really helpful > {noformat} > org.apache.tika.eval.app.io.ExtractReaderException > at > org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:125) > at > org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198) > at > org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180) > at > org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152) > at > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87) > at > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} > so I'm adding the type, the cause and also some logging for > EXTRACT_FILE_TOO_SHORT / EXTRACT_FILE_TOO_LONG so that we can know what this > is about, and then do something (or not) about it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4274) Improve ExtractReaderException
Tilman Hausherr created TIKA-4274: - Summary: Improve ExtractReaderException Key: TIKA-4274 URL: https://issues.apache.org/jira/browse/TIKA-4274 Project: Tika Issue Type: Improvement Components: tika-eval Affects Versions: 2.9.2 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 3.0.0, 2.9.3 I saw this stack trace in the eval log and it's not really helpful {noformat} org.apache.tika.eval.app.io.ExtractReaderException at org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:125) at org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198) at org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) {noformat} so I'm adding the type, the cause and also some logging for EXTRACT_FILE_TOO_SHORT / EXTRACT_FILE_TOO_LONG so that we can know what this is about, and then do something (or not) about it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
jdk22 build
I've set up a jdk22 build (renamed an older one). However some tests fail, I've opened https://issues.apache.org/jira/browse/INFRA-25943 Tilman
Re: 3.0.0-BETA2 next week?
Hi, Sure... there's currently a CVE problem with tika-dl (Deep Learning) related to ffmpeg version "6.1.1-1.5.10". I got rid of it by excluding ffmpeg and the tests still work. Is tika-dl meant to use videos too? Apparently yes: https://github.com/apache/tika/pull/165 Tilman On 03.07.2024 22:03, Tim Allison wrote: All, I think it is time to go for a 3.0.0-BETA2. What do you think about cutting that release this Friday or maybe next week? Best, Tim
[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861555#comment-17861555 ] Tilman Hausherr commented on TIKA-4181: --- PR 1849 has now succeeded. > Tika Grpc Server using Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Create a Tika Grpc server. > You should be able to create Tike Pipes fetchers, then use those fetchers. > You can then use those fetchers to FetchAndParse in 3 ways: > * synchronous fashion - you send a single request to fetch a file, and get a > single FetchAndParse response tuple. > * streaming output - you send a single request and stream back the > FetchAndParse response tuple. > * bi-directional streaming - You stream in 1 or more Fetch requests and > stream back FetchAndParse response tuples. > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861363#comment-17861363 ] Tilman Hausherr commented on TIKA-4181: --- As a first step I've updated protobuf to current in the grpc subproject and excluded a dependency. We'll see what else will succeed. If there's anything that stops working but isn't shown by the tests please revert and add a comment in the pom.xml. > Tika Grpc Server using Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Create a Tika Grpc server. > You should be able to create Tike Pipes fetchers, then use those fetchers. > You can then use those fetchers to FetchAndParse in 3 ways: > * synchronous fashion - you send a single request to fetch a file, and get a > single FetchAndParse response tuple. > * streaming output - you send a single request and stream back the > FetchAndParse response tuple. > * bi-directional streaming - You stream in 1 or more Fetch requests and > stream back FetchAndParse response tuples. > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4181) Tika Grpc Server using Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861075#comment-17861075 ] Tilman Hausherr edited comment on TIKA-4181 at 7/1/24 7:02 AM: --- Is this {code:xml} 3.24.0 3.24.0 {code} intended? This is an older version of protobuf than the one we're using. It's also preventing this PR to work: https://github.com/apache/tika/pull/1849 {noformat} 2024-07-01T06:17:39.8130959Z [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence failed with message: 2024-07-01T06:17:39.8132291Z Failed while enforcing releasability the error(s) are [ 2024-07-01T06:17:39.8133867Z Dependency convergence error for com.google.protobuf:protobuf-java-util:3.25.1 paths to dependency are: 2024-07-01T06:17:39.8135252Z +-org.apache.tika:tika-grpc:3.0.0-SNAPSHOT 2024-07-01T06:17:39.8136080Z +-io.grpc:grpc-services:1.65.0 2024-07-01T06:17:39.8136947Z +-com.google.protobuf:protobuf-java-util:3.25.1 2024-07-01T06:17:39.8137737Z and 2024-07-01T06:17:39.8138366Z +-org.apache.tika:tika-grpc:3.0.0-SNAPSHOT 2024-07-01T06:17:39.8139307Z +-com.google.protobuf:protobuf-java-util:3.24.0 {noformat} was (Author: tilman): Is this {code:xml} 3.24.0 3.24.0 {code} intended? This is an older version of protobuf than the one we're using. It's also preventing this PR to work: https://github.com/apache/tika/pull/1849 > Tika Grpc Server using Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Create a Tika Grpc server. > You should be able to create Tike Pipes fetchers, then use those fetchers. > You can then use those fetchers to FetchAndParse in 3 ways: > * synchronous fashion - you send a single request to fetch a file, and get a > single FetchAndParse response tuple. > * streaming output - you send a single request and stream back the > FetchAndParse response tuple. > * bi-directional streaming - You stream in 1 or more Fetch requests and > stream back FetchAndParse response tuples. > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861075#comment-17861075 ] Tilman Hausherr commented on TIKA-4181: --- Is this {code:xml} 3.24.0 3.24.0 {code} intended? This is an older version of protobuf than the one we're using. It's also preventing this PR to work: https://github.com/apache/tika/pull/1849 > Tika Grpc Server using Tika Pipes > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Create a Tika Grpc server. > You should be able to create Tike Pipes fetchers, then use those fetchers. > You can then use those fetchers to FetchAndParse in 3 ways: > * synchronous fashion - you send a single request to fetch a file, and get a > single FetchAndParse response tuple. > * streaming output - you send a single request and stream back the > FetchAndParse response tuple. > * bi-directional streaming - You stream in 1 or more Fetch requests and > stream back FetchAndParse response tuples. > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859718#comment-17859718 ] Tilman Hausherr commented on TIKA-4251: --- I'm wondering if this means lots of changes to check at the beginning. This is the kindof plugin that would be ideal for a supply chain attack. > [DISCUSS] move to cosium's git-code-format-maven-plugin with > google-java-format > --- > > Key: TIKA-4251 > URL: https://issues.apache.org/jira/browse/TIKA-4251 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I was recently working a bit on incubator-stormcrawler, and I noticed that > they are using cosium's git-code-format-maven-plugin: > https://github.com/Cosium/git-code-format-maven-plugin > I was initially annoyed that I couldn't quickly figure out what I had to fix > to make the linter happyl, but then I realized there was a magic command: > {{mvn git-code-format:format-code}} which just fixed the code so that the > linter passed. > The one drawback I found is that it does not fix nor does it alert on > wildcard imports. We could still use checkstyle for that but only have one > rule for checkstyle. > The other drawback is that there is not a lot of room for variation from > google's style. This may actually be a benefit, too, of course. > I just ran this on {{tika-core}} here: > https://github.com/apache/tika/tree/google-java-format > What would you think about making this change for 3.x? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4270) wrong skew angle in tika-parser-ocr-module
[ https://issues.apache.org/jira/browse/TIKA-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4270: -- Description: We use tika to extract text from different sources, including images with text that is rotated at a certain angle. To extract text from image with ocr, tika first deskew image. The skew angle is not calculated correctly. In example [^for_issue] (PNG file), the text is rotated at an angle of ~40 degrees. But the skew angle function (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle of about 15. The slope angle calculation flag is enabled. The documentation (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation) does not have sufficient information for this version of tika, there is a todo box and some relevant information for tika 1 (requires python and its libraries, but in the version of tika we use, angle calculations are implemented only using java) was: We use tika to extract text from different sources, including images with text that is rotated at a certain angle. To extract text from image with ocr, tika first deskew image. The skew angle is not calculated correctly. In example [^for_issue] , the text is rotated at an angle of ~40 degrees. But the skew angle function (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle of about 15. The slope angle calculation flag is enabled. The documentation (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation) does not have sufficient information for this version of tika, there is a todo box and some relevant information for tika 1 (requires python and its libraries, but in the version of tika we use, angle calculations are implemented only using java) > wrong skew angle in tika-parser-ocr-module > -- > > Key: TIKA-4270 > URL: https://issues.apache.org/jira/browse/TIKA-4270 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Roman >Priority: Major > Attachments: for_issue > > > We use tika to extract text from different sources, including images with > text that is rotated at a certain angle. To extract text from image with ocr, > tika first deskew image. The skew angle is not calculated correctly. In > example [^for_issue] (PNG file), the text is rotated at an angle of ~40 > degrees. But the skew angle function > (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle > of about 15. The slope angle calculation flag is enabled. > The documentation > (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation) > does not have sufficient information for this version of tika, there is a > todo box and some relevant information for tika 1 (requires python and its > libraries, but in the version of tika we use, angle calculations are > implemented only using java) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4267. - Resolution: Invalid Closing for now, please comment and/or reopen if needed. > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Summary: Not getting correct mime type for a few file extensions. example: csv (was: Not getting correct mimet type for few file extensions. example :csv) > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Affects Version/s: 1.28.4 > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:06 PM: The current version is 2.9.2, please retry with that one. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} was (Author: tilman): The current version is 2.9.2, please retry with that one. > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:07 PM: The current version is 2.9.2, please retry with that one; if it still doesn't work, please attach your csv file. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} was (Author: tilman): The current version is 2.9.2, please retry with that one. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr commented on TIKA-4267: --- The current version is 2.9.2, please retry with that one. > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-1907) Big Pdf parsing to text - Out of memory
[ https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1907: -- Fix Version/s: 3.0.0 > Big Pdf parsing to text - Out of memory > --- > > Key: TIKA-1907 > URL: https://issues.apache.org/jira/browse/TIKA-1907 > Project: Tika > Issue Type: Bug >Affects Versions: 1.12 >Reporter: Nicolas Daniels >Priority: Major > Fix For: 3.0.0 > > > Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284] > I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe > PDFBox is not the appropriate lib to use in such case. > Trying to read the same PDF using Tika leads to the same problem: > {code:title=Test.java|borderStyle=solid} > @Test > public void testParsePdf_Content_Memory() throws Exception { > { > InputStream inputStream = new > FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf"); > try { > StringWriter writer = new StringWriter(); >FileWriter fileWriter = new FileWriter(new > File("c:/tmp/test.txt")); > BodyContentHandler handler = new BodyContentHandler(fileWriter); > Metadata metadata = new Metadata(); > new PDFParser().parse(inputStream, handler, metadata, new > ParseContext()); > fileWriter.close(); > } finally { > inputStream.close(); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.
[ https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845590#comment-17845590 ] Tilman Hausherr edited comment on TIKA-4254 at 5/12/24 9:40 AM: THausherr commented on PR #1754: URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546 Maybe I get it: {{repo = config.getMimeRepository();}} isn't creating anything new, it's retrieving something that is changed later by the test? If my understanding is correct then it's a deeper problem. was (Author: githubbot): THausherr commented on PR #1754: URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546 Maybe I get it: `repo = config.getMimeRepository();` isn't creating anything new, it's retrieving something that is changed later by the test? If my understanding is correct then it's a deeper problem. > The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the > first run and fails in repeated runs in the same environment. > > > Key: TIKA-4254 > URL: https://issues.apache.org/jira/browse/TIKA-4254 > Project: Tika > Issue Type: Bug >Reporter: Kaiyao Ke >Priority: Major > > ### Brief Description of the Bug > The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the > first run but fails in the second run in the same environment. The source of > the problem is that each test execution initializes a new media type > (`MimeType`) instance `testType` (same problem for `testType2`), and all > media types across different test executions attempt to use the same name > pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of > the test, the line `this.repo.addPattern(testType, pattern, true);` will > throw an error, since the name pattern is already used by the `testType` > instance initiated from the first test execution. Specifically, in the second > run, the `addGlob()` method of the `Pattern` class will assert conflict > patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`). > ### Failure Message in the 2nd Test Run: > ``` > org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: > rtg_sst_grb_0\.5\.\d{8} > at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123) > at org.apache.tika.mime.Patterns.add(Patterns.java:71) > at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450) > at > org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > ``` > ### Reproduce > Use the `NIOInspector` plugin that supports rerunning individual tests in the > same environment: > ``` > cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package > mvn edu.illinois:NIOInspector:rerun > -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex > ``` > ### Proposed Fix > Declare `testType` and `testType2` as static variables and initialize them at > class loading time. Therefore, repeated runs of `testJavaRegex()` will not > conflict each other. All tests pass and are idempotent after the fix. > ### Necessity of Fix > A fix is recommended as unit tests shall be idempotent, and state pollution > shall be mitigated so that newly introduced tests do not fail in the future > due to polluted shared states. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.
[ https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845566#comment-17845566 ] Tilman Hausherr commented on TIKA-4254: --- Why would we ever run the test twice in the same environment? > The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the > first run and fails in repeated runs in the same environment. > > > Key: TIKA-4254 > URL: https://issues.apache.org/jira/browse/TIKA-4254 > Project: Tika > Issue Type: Bug >Reporter: Kaiyao Ke >Priority: Major > > ### Brief Description of the Bug > The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the > first run but fails in the second run in the same environment. The source of > the problem is that each test execution initializes a new media type > (`MimeType`) instance `testType` (same problem for `testType2`), and all > media types across different test executions attempt to use the same name > pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of > the test, the line `this.repo.addPattern(testType, pattern, true);` will > throw an error, since the name pattern is already used by the `testType` > instance initiated from the first test execution. Specifically, in the second > run, the `addGlob()` method of the `Pattern` class will assert conflict > patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`). > ### Failure Message in the 2nd Test Run: > ``` > org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: > rtg_sst_grb_0\.5\.\d{8} > at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123) > at org.apache.tika.mime.Patterns.add(Patterns.java:71) > at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450) > at > org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) > ``` > ### Reproduce > Use the `NIOInspector` plugin that supports rerunning individual tests in the > same environment: > ``` > cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package > mvn edu.illinois:NIODetector:rerun > -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex > ``` > ### Proposed Fix > Declare `testType` and `testType2` as static variables and initialize them at > class loading time. Therefore, repeated runs of `testJavaRegex()` will not > conflict each other. All tests pass and are idempotent after the fix. > ### Necessity of Fix > A fix is recommended as unit tests shall be idempotent, and state pollution > shall be mitigated so that newly introduced tests do not fail in the future > due to polluted shared states. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: Bump dependabot to weekly?
Yes! Tilman On 29.04.2024 16:55, Tim Allison wrote: Oh, interesting. Should we bump this value to, say, 20? https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file#open-pull-requests-limit ? Thank you, Tilman! On Mon, Apr 29, 2024 at 10:47 AM Tilman Hausherr wrote: The positive side is that it's less interruptions. One negative side is that there seems to be a maximum. Today it didn't report the AWS update, which was detected in the past. Tilman
Re: Bump dependabot to weekly?
The positive side is that it's less interruptions. One negative side is that there seems to be a maximum. Today it didn't report the AWS update, which was detected in the past. Tilman On 29.04.2024 16:34, Tim Allison wrote: The move to weekly dependabot has been a bit of a relief for me personally. Our mail list isn't clogged w daily dependabot updates (and yes, I know I can apply a filter :/). How is it working for everyone else? On Wed, Apr 10, 2024 at 4:09 PM Tim Allison wrote: you start deleting them reflexively out of your email! Not Tilman!!! Let's move to weekly and see how that works? On Wed, Apr 10, 2024 at 3:57 PM Eric Pugh wrote: Hence why I like the monthly unless it’s a special case…. The flood of updates just means you start deleting them reflexively out of your email! Now, if you have a dependency and you’re maybe actively working on it, and it’s changing quickly, then that might be an argument for daily. On Apr 10, 2024, at 12:53 PM, Tilman Hausherr wrote: I'm fine with daily because this way we can learn ASAP if there are troubles with new dependency versions, although I'm now too busy. Tilman -- Original-Nachricht -- Von: Tim Allison Betreff: Bump dependabot to weekly? Datum: 10.04.2024, 18:08 Uhr An: All, Tilman has been doing heroic work keeping us up to date with dependabot's PRs. Given our pace of releases, would it make sense to backoff to weekly updates? Before running regression tests, we'd run the update plugin to make sure that we're up to date. What do you think? Best, Tim ___ Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com < http://www.opensourceconnections.com/> | My Free/Busy < http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
[jira] [Commented] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840922#comment-17840922 ] Tilman Hausherr commented on TIKA-4245: --- The file claims to be utf-16 but it isn't. If I change it to utf-8 in the editor then I get an NPE in the GUI. > Tika does not get html content properly > > > Key: TIKA-4245 > URL: https://issues.apache.org/jira/browse/TIKA-4245 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: Sample html file and tika config xml.zip > > > We use org.apache.tika.parser.AutoDetectParser to get the content of html > files. And we found out that it does not get the content fo the sample file > properly. > Following is the sample code and attached is the tika-config.xml and the > sample html file. The content extracted with Tika reads > "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different > from the native file. > > > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2. > {code:java} > import org.apache.commons.io.FileUtils; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > > import java.io.File; > import java.io.FileInputStream; > import java.io.PrintWriter; > import java.nio.file.Files; > import java.nio.file.Path; > import java.nio.file.Paths; > > public class ExtractTxtFromHtml { > private static final Path inputFile = new > File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); > > public static void main(String args[]) { > extactText(false); > extactText(true); > } > > static void extactText(boolean largeFile) { > PrintWriter outputFileWriter = null; > try { > BodyContentHandler handler; > Path outputFilePath = null; > > if (largeFile) { > // write tika output to disk > outputFilePath = > Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); > outputFileWriter = new > PrintWriter(Files.newOutputStream(outputFilePath)); > handler = new BodyContentHandler(outputFileWriter); > } else { > // stream it in memory > handler = new BodyContentHandler(-1); > } > > Metadata metadata = new Metadata(); > FileInputStream inputData = new > FileInputStream(inputFile.toFile()); > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); > Parser autoDetectParser = new AutoDetectParser(config); > ParseContext context = new ParseContext(); > context.set(TikaConfig.class, config); > autoDetectParser.parse(inputData, handler, metadata, context); > > String content; > if (largeFile) { > content = FileUtils.readFileToString(outputFilePath.toFile()); > } > else { > content = handler.toString(); > } > System.out.println("content = " + content); > } > catch(Exception ex) { > ex.printStackTrace(); > } finally { > if (outputFileWriter != null) { > outputFileWriter.close(); > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840908#comment-17840908 ] Tilman Hausherr commented on TIKA-4245: --- Happens also with the tika app GUI. > Tika does not get html content properly > > > Key: TIKA-4245 > URL: https://issues.apache.org/jira/browse/TIKA-4245 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: Sample html file and tika config xml.zip > > > We use org.apache.tika.parser.AutoDetectParser to get the content of html > files. And we found out that it does not get the content fo the sample file > properly. > Following is the sample code and attached is the tika-config.xml and the > sample html file. The content extracted with Tika reads > "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different > from the native file. > > > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2. > {code:java} > import org.apache.commons.io.FileUtils; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.sax.BodyContentHandler; > > import java.io.File; > import java.io.FileInputStream; > import java.io.PrintWriter; > import java.nio.file.Files; > import java.nio.file.Path; > import java.nio.file.Paths; > > public class ExtractTxtFromHtml { > private static final Path inputFile = new > File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); > > public static void main(String args[]) { > extactText(false); > extactText(true); > } > > static void extactText(boolean largeFile) { > PrintWriter outputFileWriter = null; > try { > BodyContentHandler handler; > Path outputFilePath = null; > > if (largeFile) { > // write tika output to disk > outputFilePath = > Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); > outputFileWriter = new > PrintWriter(Files.newOutputStream(outputFilePath)); > handler = new BodyContentHandler(outputFileWriter); > } else { > // stream it in memory > handler = new BodyContentHandler(-1); > } > > Metadata metadata = new Metadata(); > FileInputStream inputData = new > FileInputStream(inputFile.toFile()); > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); > Parser autoDetectParser = new AutoDetectParser(config); > ParseContext context = new ParseContext(); > context.set(TikaConfig.class, config); > autoDetectParser.parse(inputData, handler, metadata, context); > > String content; > if (largeFile) { > content = FileUtils.readFileToString(outputFilePath.toFile()); > } > else { > content = handler.toString(); > } > System.out.println("content = " + content); > } > catch(Exception ex) { > ex.printStackTrace(); > } finally { > if (outputFileWriter != null) { > outputFileWriter.close(); > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4245) Tika does not get html content properly
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4245: -- Description: We use org.apache.tika.parser.AutoDetectParser to get the content of html files. And we found out that it does not get the content fo the sample file properly. Following is the sample code and attached is the tika-config.xml and the sample html file. The content extracted with Tika reads "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different from the native file. The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2. {code:java} import org.apache.commons.io.FileUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ExtractTxtFromHtml { private static final Path inputFile = new File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); public static void main(String args[]) { extactText(false); extactText(true); } static void extactText(boolean largeFile) { PrintWriter outputFileWriter = null; try { BodyContentHandler handler; Path outputFilePath = null; if (largeFile) { // write tika output to disk outputFilePath = Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); outputFileWriter = new PrintWriter(Files.newOutputStream(outputFilePath)); handler = new BodyContentHandler(outputFileWriter); } else { // stream it in memory handler = new BodyContentHandler(-1); } Metadata metadata = new Metadata(); FileInputStream inputData = new FileInputStream(inputFile.toFile()); TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml"); Parser autoDetectParser = new AutoDetectParser(config); ParseContext context = new ParseContext(); context.set(TikaConfig.class, config); autoDetectParser.parse(inputData, handler, metadata, context); String content; if (largeFile) { content = FileUtils.readFileToString(outputFilePath.toFile()); } else { content = handler.toString(); } System.out.println("content = " + content); } catch(Exception ex) { ex.printStackTrace(); } finally { if (outputFileWriter != null) { outputFileWriter.close(); } } } } {code} was: We use org.apache.tika.parser.AutoDetectParser to get the content of html files. And we found out that it does not get the content fo the sample file properly. Following is the sample code and attached is the tika-config.xml and the sample html file. The content extracted with Tika reads "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢潮瑥湴㴢瑥硴…". That is different from the native file. The operating system is Ubuntu 20.04. Java version is 21. Tika version is 2.9.2. import org.apache.commons.io.FileUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class ExtractTxtFromHtml { private static final Path inputFile = new File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath(); public static void main(String args[]) { extactText(false); extactText(true); } static void extactText(boolean largeFile) { PrintWriter outputFileWriter = null; try { BodyContentHandler handler; Path outputFilePath = null; if (largeFile) { // write tika output to disk outputFilePath = Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt"); outputFileWriter = new PrintWriter(Files.newOutputStream(outputFilePath)); handler = new BodyContentHandler(outputFileWriter); } else {
[jira] [Comment Edited] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745 ] Tilman Hausherr edited comment on TIKA-4166 at 4/22/24 3:27 PM: It turned out to be something different than the missing package. After googling for the error message I found an SO answer that I had upvoted in the past https://stackoverflow.com/a/54467008/535646 was (Author: tilman): It turned out to be something different than the missing package. After googling for the error message I found an SO that I had upvoted in the past https://stackoverflow.com/a/54467008/535646 > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build > Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745 ] Tilman Hausherr commented on TIKA-4166: --- It turned out to be something different than the missing package. After googling for the error message I found an SO that I had upvoted in the past https://stackoverflow.com/a/54467008/535646 > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build > Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: How to proceed when you are getting OSS index errors?
Hi, We look what the CVE is about. Some CVEs are irrelevant (see recent rant from Tim) and we can add an exclusion in the OSS section. Sometimes all what is needed is to update a dependency or add it in the management section or exclude it (in the assumptions that the tests cover everything). About this case: it has been updated in the repository to exclude two threeten versions from OSS. Tilman On 22.04.2024 16:16, Nicholas DiPiazza wrote: When getting these sorts of errors: [ERROR] Failed to execute goal org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit (audit-dependencies) on project tika-dl: Detected 1 vulnerable components: [ERROR] org.threeten:threetenbp:jar:1.3.3:provided; https://ossindex.sonatype.org/component/pkg:maven/org.threeten/threetenbp@1.3.3?utm_source=ossindex-client_medium=integration_content=1.8.1 [ERROR] * [CVE-2024-23081] CWE-476: NULL Pointer Dereference (3.7); https://ossindex.sonatype.org/vulnerability/CVE-2024-23081?component-type=maven=org.threeten%2Fthreetenbp_source=ossindex-client_medium=integration_content=1.8.1 [ERROR] * [CVE-2024-23082] CWE-190: Integer Overflow or Wraparound (5.3); https://ossindex.sonatype.org/vulnerability/CVE-2024-23082?component-type=maven=org.threeten%2Fthreetenbp_source=ossindex-client_medium=integration_content=1.8.1 [ERROR] How do you all typically proceed? Do I patch the issue and move on somehow? How do i get my builds to work now that this error has happened? -Nicholas
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839652#comment-17839652 ] Tilman Hausherr commented on TIKA-4166: --- The latest Apache parent update means a javadoc update and it results in a failure on the ci: {noformat} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.6.3:aggregate (default-cli) on project tika: An error has occurred in Javadoc report generation: [ERROR] Exit code: 2 [ERROR] javadoc: error - No source files for package org.apache.tika.extractor [ERROR] Command line was: /usr/local/asfpackages/java/adoptium-jdk-11.0.16.1+1/bin/javadoc @options @packages {noformat} A possible cause for this could be that in tika-batch there is a test package that doesn't exist as a source package. It didn't happen locally for me because I didn't use "javadoc:aggregate". I'll do some more tests to see whether renaming the test package fixes this. > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4240) Change dependabot to weekly
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836236#comment-17836236 ] Tilman Hausherr commented on TIKA-4240: --- I prefer daily but if more people feel pressured or annoyed by these mails (I never felt that way) then I accept weekly. > Change dependabot to weekly > --- > > Key: TIKA-4240 > URL: https://issues.apache.org/jira/browse/TIKA-4240 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tim Allison >Priority: Trivial > > On the list, I proposed this change. Some were in favor of dropping it back > to monthly. [~tilman] made the argument for the benefit of seeing problems > quickly and also acknowledged that it is a burden to merge the daily PRs. > I propose bumping dependabot back to weekly for a bit, and we'll see how it > works as a middle ground. > If anyone feels strongly about moving back to daily, we can do that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4240) Change dependabot to weekly
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4240: -- Component/s: build > Change dependabot to weekly > --- > > Key: TIKA-4240 > URL: https://issues.apache.org/jira/browse/TIKA-4240 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tim Allison >Priority: Trivial > > On the list, I proposed this change. Some were in favor of dropping it back > to monthly. [~tilman] made the argument for the benefit of seeing problems > quickly and also acknowledged that it is a burden to merge the daily PRs. > I propose bumping dependabot back to weekly for a bit, and we'll see how it > works as a middle ground. > If anyone feels strongly about moving back to daily, we can do that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4240) Change dependabot to weekly
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836224#comment-17836224 ] Tilman Hausherr commented on TIKA-4240: --- Not a burden (that was Eric, sort-of), I just don't have the time right now to fix the current build failure. I like the alerts, it's a low hanging fruit and also helps me to learn more about the code. > Change dependabot to weekly > --- > > Key: TIKA-4240 > URL: https://issues.apache.org/jira/browse/TIKA-4240 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > > On the list, I proposed this change. Some were in favor of dropping it back > to monthly. [~tilman] made the argument for the benefit of seeing problems > quickly and also acknowledged that it is a burden to merge the daily PRs. > I propose bumping dependabot back to weekly for a bit, and we'll see how it > works as a middle ground. > If anyone feels strongly about moving back to daily, we can do that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
AW: Bump dependabot to weekly?
I'm fine with daily because this way we can learn ASAP if there are troubles with new dependency versions, although I'm now too busy. Tilman -- Original-Nachricht -- Von: Tim Allison Betreff: Bump dependabot to weekly? Datum: 10.04.2024, 18:08 Uhr An: All, Tilman has been doing heroic work keeping us up to date with dependabot's PRs. Given our pace of releases, would it make sense to backoff to weekly updates? Before running regression tests, we'd run the update plugin to make sure that we're up to date. What do you think? Best, Tim
[jira] [Commented] (TIKA-4238) replace some deprecated code
[ https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529 ] Tilman Hausherr commented on TIKA-4238: --- This was a low-hanging fruit. I could also have done UnsynchronizedByteArrayInputStream, but replacing that one would not only would make the code much bigger, it would also require to catch an exception that isn't thrown now, so lets just wait what they do. https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get() > replace some deprecated code > > > Key: TIKA-4238 > URL: https://issues.apache.org/jira/browse/TIKA-4238 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4238) replace some deprecated code
[ https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529 ] Tilman Hausherr edited comment on TIKA-4238 at 4/6/24 2:12 PM: --- This was a low-hanging fruit. I could also have done UnsynchronizedByteArrayInputStream, but replacing that one would not only make the code much bigger, it would also require to catch an exception that isn't thrown now, so lets just wait what they do in the future. https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get() was (Author: tilman): This was a low-hanging fruit. I could also have done UnsynchronizedByteArrayInputStream, but replacing that one would not only would make the code much bigger, it would also require to catch an exception that isn't thrown now, so lets just wait what they do. https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get() > replace some deprecated code > > > Key: TIKA-4238 > URL: https://issues.apache.org/jira/browse/TIKA-4238 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4218: -- Affects Version/s: 2.9.1 > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.1 >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.9.2 > > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4218. --- Assignee: Tim Allison Resolution: Fixed > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned TIKA-4171: - Assignee: Tim Allison > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Assignee: Tim Allison >Priority: Major > Fix For: 3.0.0-BETA, 2.9.2 > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, > testPDF_XFA_govdocs1_258578.pdf.html > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4218: -- Fix Version/s: 2.9.2 > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.9.2 > > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key
[ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4171. --- Resolution: Fixed > Tika server only returns last value for PDFs that have multiple of the same > key > --- > > Key: TIKA-4171 > URL: https://issues.apache.org/jira/browse/TIKA-4171 > Project: Tika > Issue Type: Bug > Components: tika-server >Reporter: Cassandra Xia >Priority: Major > Fix For: 2.9.2, 3.0.0-BETA > > Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert > FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, > testPDF_XFA_govdocs1_258578.pdf.html > > > Thanks for the great work on Tika server, it is the only OSS that can handle > Adobe's protected form format that FERC uses. > One problem that I'm hitting is that the FERC form that I am parsing has > multiple values for the same key name, e.g. in the screenshot below line 1-7 > all have the same key name. When Tika Server parses this PDF, it only returns > the value in row 7 (losing the previous 6 values). > My hunch is that somewhere in Tika Server, the values are getting stored in > some dictionary object, so the final value is the only survivor. Would it be > possible to return the extra values as a list from Tika Server? > Example PDF attached - thank you for taking a look! > !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4238) replace some deprecated code
[ https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4238. --- Resolution: Fixed > replace some deprecated code > > > Key: TIKA-4238 > URL: https://issues.apache.org/jira/browse/TIKA-4238 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4239) Update to 2.9.3
Tilman Hausherr created TIKA-4239: - Summary: Update to 2.9.3 Key: TIKA-4239 URL: https://issues.apache.org/jira/browse/TIKA-4239 Project: Tika Issue Type: Task Components: build Reporter: Tilman Hausherr -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4239) Update to 2.9.3
[ https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4239: -- Affects Version/s: 2.9.2 > Update to 2.9.3 > --- > > Key: TIKA-4239 > URL: https://issues.apache.org/jira/browse/TIKA-4239 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4162) Update to 2.9.2
[ https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4162. --- Assignee: Tilman Hausherr Resolution: Fixed > Update to 2.9.2 > --- > > Key: TIKA-4162 > URL: https://issues.apache.org/jira/browse/TIKA-4162 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.1 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.9.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4238) replace some deprecated code
Tilman Hausherr created TIKA-4238: - Summary: replace some deprecated code Key: TIKA-4238 URL: https://issues.apache.org/jira/browse/TIKA-4238 Project: Tika Issue Type: Task Affects Versions: 2.9.2 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 3.0.0, 2.9.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
2.9.2 / 2.9.3 admin
I've created 2.9.3 version in JIRA administration. Someone (Tim?) please set the 2.9.2 version to released or whatever (I didn't want to touch that part) Tilman
[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4236: -- Fix Version/s: 2.9.3 > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4236: -- Fix Version/s: (was: 2.9.2) > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Priority: Major > Fix For: 3.0.0 > > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency
[ https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4236. --- Assignee: Tilman Hausherr Resolution: Fixed > tika-parser-nlp-module has an unnecessary Guava dependency > -- > > Key: TIKA-4236 > URL: https://issues.apache.org/jira/browse/TIKA-4236 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2 >Reporter: Manfred Baedke >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > This should be avoided, because it's prone to maintenance and security > problems. > It's easy to get rid of it: the class > {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses > {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson > anyway, it could just be replaced with > {{{}com.google.gson.reflect.TypeToken{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)