[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4283:
--
Component/s: core
 parser

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>  Components: core, parser
>Affects Versions: 2.9.2
>Reporter: Lonzak
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4283:
--
Affects Version/s: 2.9.2

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>Affects Versions: 2.9.2
>Reporter: Lonzak
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4283.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

Done, it's now in 2.* as well, thanks.

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>  Components: core, parser
>Affects Versions: 2.9.2
>Reporter: Lonzak
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4283:
--
Fix Version/s: 3.0.0

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>Reporter: Lonzak
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4285) Invalid Link for changelog CHANGES.txt files

2024-07-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867684#comment-17867684
 ] 

Tilman Hausherr commented on TIKA-4285:
---

Additionally: the 3.0.0-BETA2 link works, however the text mentions "Tika 
2.9.2".

> Invalid Link for changelog CHANGES.txt files
> 
>
> Key: TIKA-4285
> URL: https://issues.apache.org/jira/browse/TIKA-4285
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.0, 2.9.1, 2.9.2
>Reporter: Lonzak
>Priority: Major
>
> On the tika [start page|https://tika.apache.org/] the linked change log files 
> CHANGES.txt starting with version 2.9.0 are missing/broken.
>  
> {+}Working{+}:
> https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt
> +Not working:+
> https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt
> https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt
> https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13

2024-07-19 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4284.
-
Resolution: Invalid

> [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and 
> strudl.0.3.13
> ---
>
> Key: TIKA-4284
> URL: https://issues.apache.org/jira/browse/TIKA-4284
> Project: Tika
>  Issue Type: Bug
>Reporter: Abhijit Rajwade
>Priority: Major
>  Labels: SECURITY
>
> CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : 
> [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.
> ===
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  strudl.0.3.13 : [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13

2024-07-19 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867236#comment-17867236
 ] 

Tilman Hausherr commented on TIKA-4284:
---

How is this related to Tika? What subproject uses activemq-osgi-5.17.6 and 
strudl.0.3.13?

> [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and 
> strudl.0.3.13
> ---
>
> Key: TIKA-4284
> URL: https://issues.apache.org/jira/browse/TIKA-4284
> Project: Tika
>  Issue Type: Bug
>Reporter: Abhijit Rajwade
>Priority: Major
>  Labels: SECURITY
>
> CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : 
> [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.
> ===
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  strudl.0.3.13 : [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Description: 
The latest h2 version (which needs jdk11) brings a syntax error because of an 
unneeded comma in one SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106

  was:
The latest h2 (which needs jdk11) version brings a syntax error because of an 
unneeded comma in one SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106


> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> The latest h2 version (which needs jdk11) brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4282.
---
Resolution: Fixed

> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4282:
-

 Summary: Syntax error with h2 version 2.3.230
 Key: TIKA-4282
 URL: https://issues.apache.org/jira/browse/TIKA-4282
 Project: Tika
  Issue Type: Bug
  Components: tika-eval
Affects Versions: 3.0.0-BETA
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 3.0.0


The latest h2 version brings a syntax error because of an unneeded comma in one 
SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Description: 
The latest h2 (which needs jdk11) version brings a syntax error because of an 
unneeded comma in one SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106

  was:
The latest h2 version brings a syntax error because of an unneeded comma in one 
SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106


> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Affects Version/s: 2.9.2

> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Fix Version/s: 2.9.3

> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-1155) Number Format is converted with an error

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-1155.
-
Resolution: Cannot Reproduce

Closing because it can no longer be reproduced, it has probably been fixed 
either by us or in POI. Please comment and/or reopen if you disagree.

> Number Format is converted with an error
> 
>
> Key: TIKA-1155
> URL: https://issues.apache.org/jira/browse/TIKA-1155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Evgeniy Buyanov
>Priority: Major
> Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code:Title=Source data}
> 
><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* 
> &quot;-&quot;\ _B_F_-;_-@_-"/>
> 
> 10
> -10
> {code}
> java -jar tika-app-1.4.jar test.xlsx > test.xml
> {code:Title=Result}
>   * 10 _F
>   -10 _F
> {code}
> related ASF Bugzilla – Bug 
> [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-1155) Number Format is converted with an error

2024-07-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866408#comment-17866408
 ] 

Tilman Hausherr commented on TIKA-1155:
---

Current output:
{code:xml}
Sheet1
  10
-   10
-
text



{code}
Looks like this on the screen:
 !screenshot-1.png! 

> Number Format is converted with an error
> 
>
> Key: TIKA-1155
> URL: https://issues.apache.org/jira/browse/TIKA-1155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Evgeniy Buyanov
>Priority: Major
> Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code:Title=Source data}
> 
><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* 
> &quot;-&quot;\ _B_F_-;_-@_-"/>
> 
> 10
> -10
> {code}
> java -jar tika-app-1.4.jar test.xlsx > test.xml
> {code:Title=Result}
>   * 10 _F
>   -10 _F
> {code}
> related ASF Bugzilla – Bug 
> [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-1155) Number Format is converted with an error

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1155:
--
Attachment: screenshot-1.png

> Number Format is converted with an error
> 
>
> Key: TIKA-1155
> URL: https://issues.apache.org/jira/browse/TIKA-1155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Evgeniy Buyanov
>Priority: Major
> Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code:Title=Source data}
> 
><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* 
> &quot;-&quot;\ _B_F_-;_-@_-"/>
> 
> 10
> -10
> {code}
> java -jar tika-app-1.4.jar test.xlsx > test.xml
> {code:Title=Result}
>   * 10 _F
>   -10 _F
> {code}
> related ASF Bugzilla – Bug 
> [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-3028) Failed test at SAS7BDATParserTest:112

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-3028.
-
Resolution: Cannot Reproduce

Closing for now because of no activity for years, please reopen if it still 
happens. I remember I had several problems in my early months as a committer 
with a german locale, and we did some fixes in the code and some configuration 
changes in my IDE.

> Failed test at SAS7BDATParserTest:112
> -
>
> Key: TIKA-3028
> URL: https://issues.apache.org/jira/browse/TIKA-3028
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Wknds
>Priority: Blocker
> Attachments: Bildschirmfoto 2020-01-24 um 23.12.20.png
>
>
> Test fails at 
> SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107.
> Expected date is _01Jan1960:00:00_
> while the dates in the (untouched) test file are abbreviated by an '.' on my 
> system (please refer to the terminal output below).
> {code:java}
> // code placeholder
> [ERROR] Failures: 
> [ERROR]   
> SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107 
> 01Jan1960:00:00 not found in:
> TESTING   Record Number   Square of the Record Number Description of 
> the Row  Percent DonePercent Increment   datedatetimetime 
>0   0   This is row0 of   100%  
> 01-01-1960  01Jan.1960:00:00:01.00  00:00:011   1   This 
> is row1 of   1010% 0.0%02-01-1960  
> 01Jan.1960:00:00:10.00  00:00:032   4   This is row   
>  2 of   1020% 50.0%   17-01-1960  
> 01Jan.1960:00:01:40.00  00:00:093   9   This is row   
>  3 of   1030% 66.7%   22-03-1960  
> 01Jan.1960:00:16:40.00  00:00:274   16  This is row   
>  4 of   1040% 75.0%   13-09-1960  
> 01Jan.1960:02:46:40.00  00:01:215   25  This is row   
>  5 of   1050% 80.0%   17-09-1961  
> 02Jan.1960:03:46:40.00  00:04:036   36  This is row   
>  6 of   1060% 83.3%   20-07-1963  
> 12Jan.1960:13:46:40.00  00:12:097   49  This is row   
>  7 of   1070% 85.7%   29-07-1966  
> 25Apr.1960:17:46:40.00  00:36:278   64  This is row   
>  8 of   1080% 87.5%   20-03-1971  
> 03März1963:09:46:40.00  01:49:219   81  This is row   
>  9 of   1090% 88.9%   18-12-1977  
> 09Sep.1991:01:46:40.00  05:28:0310  100 This is row   
> 10 of   10100%90.0%   19-05-1987  
> 19Nov.2276:17:46:40.00  16:24:09
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3290) Extension reading it as eml instead of txt

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-3290:
--
Fix Version/s: (was: 1.24.1)

> Extension reading it as eml instead of txt
> --
>
> Key: TIKA-3290
> URL: https://issues.apache.org/jira/browse/TIKA-3290
> Project: Tika
>  Issue Type: Bug
>  Components: core, mime
>Affects Versions: 1.25
>Reporter: Tika User
>Priority: Major
>  Labels: tika-parsers
> Attachments: image-2021-02-22-10-13-08-447.png, 
> image-2021-02-23-12-39-00-778.png, test_sample_message.txt
>
>
> The attached file extension is reading it as eml instead of txt. With version 
> 1.24.1 it is reading it as txt and now with the upgrade to 1.25, it is 
> reading it as eml. So that while parsing we are getting mail corrupted error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-3172.
---
Fix Version/s: 1.25
 Assignee: Tim Allison
   Resolution: Fixed

> PDF Parser configuration enable auto space using tika config file
> -
>
> Key: TIKA-3172
> URL: https://issues.apache.org/jira/browse/TIKA-3172
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.25
>
>
> Need information on how to set enableAutoSpace using tika config file.
> {code:java}
> /
>   
> 
>   
> 
> 
>   
> false
>   
> 
>   
> / 
> {code}
> Above configuration is not working.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-3155) Parse Error while extracting CSV files

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-3155.
-
Resolution: Duplicate

Closing as duplicate of TIKA-4278. This isn't a CSV file by the improved logic.

> Parse Error while extracting CSV files
> --
>
> Key: TIKA-3155
> URL: https://issues.apache.org/jira/browse/TIKA-3155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
> Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>   ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>   at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>   at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>   at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866277#comment-17866277
 ] 

Tilman Hausherr commented on TIKA-4278:
---

If colon and another delimiter have been detected with the same confidence, use 
the other one.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Attachment: reports_csv_2.9.2_vs_2.9.3_4.tar.xz

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866147#comment-17866147
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:40 PM:


I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: false colon-separated lines. I never had any in 
decades, but a google search does find some SO questions, so I'll leave that 
there for now. We can still change it after the "big" regression tests.


was (Author: tilman):
I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now. We can still change it after the "big" regression tests.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866147#comment-17866147
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:24 PM:


I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now. We can still change it after the "big" regression tests.


was (Author: tilman):
I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866147#comment-17866147
 ] 

Tilman Hausherr commented on TIKA-4278:
---

I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Attachment: reports_csv_2.9.2_vs_2.9.3_3.tar.xz

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866059#comment-17866059
 ] 

Tilman Hausherr commented on TIKA-4278:
---

Many files are detected as csv that are not, e.g. govdocs1/040/040251.txt

govdocs1/242/242970.txt, govdocs1/001/001605.txt: now has a ":" as separator 
although it's obvious that it's a ",". Maybe because of TIME_HH:MM:SS?!

govdocs1/346/346152.txt is considered to be pipe-separated, despite that it's a 
text file, although it's a table. IMHO it shouldn't "detect" something that 
isn't in the first line. This would also solve the problem with 
govdocs1/040/040251.txt .

govdocs1/113/113291.txt: claims that it contains "컴컴" but it doesn't. I assume 
this is a different change than mine because my changes aren't related to the 
encoding.

I'll rerun the tests with a change that returns 0 confidence in CSVSniffer when 
the delimiter is not in row zero.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Attachment: reports_csv_2.9.2_vs_2.9.3.tar.xz

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4278.
---
Fix Version/s: 3.0.0
   2.9.3
 Assignee: Tilman Hausherr
   Resolution: Fixed

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Labels: csv csvparser  (was: )

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865884#comment-17865884
 ] 

Tilman Hausherr commented on TIKA-4278:
---

The next build did work.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Major
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-14 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Description: 
I ran the code from the attached SO issue and yes it doesn't detect semicolon 
separated files. The reason is this line in {{TextAndCSVParser.java}}:
{code:java}
private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
{code}
This is later used by {{CSVSniffer}}. For some reason the other delimiters 
(pipe, colon and semicolon) aren't in that array, although they are in 
{{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it 
works for semicolon.

Can I change this by adding the missing delimiters or was there a reason that I 
missed? Proposed change would change CSVSniffer so that delimiters is a set and 
then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.

  was:
I ran the code from the attached SO issue and yes it doesn't detect semicolon 
separated files. The reason is this line in {{TextAndCSVParser.java}}:
{code:java}
private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
{code}
This is later uses by {{CSVSniffer}}. For some reason the other delimiters 
(pipe, colon and semicolon) aren't in that array, although they are in 
{{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it 
works for semicolon.

Can I change this by adding the missing delimiters or was there a reason that I 
missed? Proposed change would change CSVSniffer so that delimiters is a set and 
then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.


> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Major
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-13 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Description: 
I ran the code from the attached SO issue and yes it doesn't detect semicolon 
separated files. The reason is this line in {{TextAndCSVParser.java}}:
{code:java}
private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
{code}
This is later uses by {{CSVSniffer}}. For some reason the other delimiters 
(pipe, colon and semicolon) aren't in that array, although they are in 
{{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it 
works for semicolon.

Can I change this by adding the missing delimiters or was there a reason that I 
missed? Proposed change would change CSVSniffer so that delimiters is a set and 
then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.

  was:
I ran the code from the attached SO issue and yes it doesn't detect semicolon 
separated files. The reason is this line in {{TextAndCSVParser.java}}:
{code:java}
private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
{code}
This is later uses by {{CSVSniffer}}. For some reason the other delimiters 
(pipe, colon and semicolon) aren't in that array, although they are in 
{{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it 
works for semicolon.

Can I change this by adding the missing delimiters or was there a reason that I 
missed?


> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Major
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later uses by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2833) Add a CSV/TSV detector

2024-07-13 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865679#comment-17865679
 ] 

Tilman Hausherr commented on TIKA-2833:
---

[~joshm] please create a new ticket. Alternatively use {{TextAndCSVParser}} 
which can detect some csv files but not all, see TIKA-4278.

> Add a CSV/TSV detector
> --
>
> Key: TIKA-2833
> URL: https://issues.apache.org/jira/browse/TIKA-2833
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.21
>
> Attachments: csv_reports.zip
>
>
> Given initial experimentation, I think we can fairly easily add a fairly 
> robust CSV/TSV detector that will identify well-formed (ha!) csvs and return 
> the charset encoding and the delimiter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-2833) Add a CSV/TSV detector

2024-07-13 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-2833:
--
Fix Version/s: 1.21

> Add a CSV/TSV detector
> --
>
> Key: TIKA-2833
> URL: https://issues.apache.org/jira/browse/TIKA-2833
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.21
>
> Attachments: csv_reports.zip
>
>
> Given initial experimentation, I think we can fairly easily add a fairly 
> robust CSV/TSV detector that will identify well-formed (ha!) csvs and return 
> the charset encoding and the delimiter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-13 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4278:
-

 Summary: TextAndCSVParser doesn't detect semicolon separated file
 Key: TIKA-4278
 URL: https://issues.apache.org/jira/browse/TIKA-4278
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.9.2
Reporter: Tilman Hausherr


I ran the code from the attached SO issue and yes it doesn't detect semicolon 
separated files. The reason is this line in {{TextAndCSVParser.java}}:
{code:java}
private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
{code}
This is later uses by {{CSVSniffer}}. For some reason the other delimiters 
(pipe, colon and semicolon) aren't in that array, although they are in 
{{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now it 
works for semicolon.

Can I change this by adding the missing delimiters or was there a reason that I 
missed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-12 Thread Tilman Hausherr

+1

built on windows 10 jdk11

Before releasing the real 3.0.0 we need to remove any "-M" dependencies 
(I've added these so we support these other projects by testing them), 
and decide about the ffmpeg issue and the hdf5 issue.


Tilman

On 12.07.2024 18:08, Tim Allison wrote:

A candidate for the Tika 3.0.0-BETA2 release is available at:
https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA2

The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/3.0.0-BETA2-rc1/

The SHA-512 checksum of the archive is
8a4142f61110f196c550146637994d26f66d6c798fc9e1d18dcadcb8a8fe817a52f59f3a03341809131f59b644fa2e183212bdee5f292d3d603d1a5a893c6848.

In addition, a staged maven repository is available here:
https://repository.apache.org/content/repositories/orgapachetika-1105/org/apache/tika

Please vote on releasing this package as Apache Tika 3.0.0-BETA2.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 3.0.0-BETA2
[ ] -1 Do not release this package because...

Here's my +1.

Thank you, all!

Best,

  Tim





[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated

2024-07-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865340#comment-17865340
 ] 

Tilman Hausherr commented on TIKA-4277:
---

Done.

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
>  Labels: config.xml
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4277:
--
Labels: config.xml  (was: )

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
>  Labels: config.xml
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4277.
-
Resolution: Duplicate

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
>  Labels: config.xml
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865297#comment-17865297
 ] 

Tilman Hausherr commented on TIKA-4277:
---

To see what parameters are available and how to use them, do this:
{noformat}
java -jar tika-app-VERSION.jar --config=config.xml --dump-current-config
{noformat}
I get this:
{code:xml}


  
  
  

  
  
  

  
  


  
true
0.3
true
true
2.5
true
true
false
true
true
false
false
false
false
false
true
false
NONE
10
536870912
300
png
1.0
GRAY
ALL
AUTO
10,10
false
false
true
0.5
false
false
  
  

  

{code}

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4276) Tika fails to detect damaged pdf

2024-07-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4276.
-
Resolution: Not A Bug

> Tika fails to detect damaged pdf
> 
>
> Key: TIKA-4276
> URL: https://issues.apache.org/jira/browse/TIKA-4276
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use Tika to check file type and extension. However, with some damaged pdf 
> files Tika detects them as text file.
> Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
> extension.
> Following is the sample code and the link to the tika-config.xml and the 
> sample PDF file is 
> [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2 and POI version is 5.2.3.   
>  
>  
> {code:java}
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.mime.MimeType;
>  
> import java.io.FileInputStream;
>  
> public class DetectDamagedPDF {
>  
> public static void main(String args[]) {
> try {
> String filePath = 
> "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");
> Detector detector = config.getDetector();
> Metadata metadata = new Metadata();
> FileInputStream fis = new FileInputStream(filePath);
> TikaInputStream stream = TikaInputStream.get(fis);
> metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);
> MediaType mediaType = detector.detect(stream, metadata);
> MimeType mimeType = 
> config.getMimeRepository().forName(mediaType.toString());
> String tikaExtension = mimeType.getExtension();
> System.out.println("tikaExtension = " + tikaExtension);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> }
> }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143
 ] 

Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 7:10 PM:


You should add / integrate something like this:
{code:xml}








true




{code}


was (Author: tilman):
You should add / integrate something like this:
{code:xml}








true




{code}

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143
 ] 

Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 7:00 PM:


You should add / integrate something like this:
{code:xml}








true




{code}


was (Author: tilman):
You should add / integrate something like this:
{code:xml}







true




{code}

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143
 ] 

Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 3:53 PM:


You should add / integrate something like this:
{code:xml}







true




{code}


was (Author: tilman):
You should add / integrate something like this:
{code:xml}

    
        
            
                true
            
        
    

{code}

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865143#comment-17865143
 ] 

Tilman Hausherr commented on TIKA-4277:
---

You should add / integrate something like this:
{code:xml}

    
        
            
                true
            
        
    

{code}

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865142#comment-17865142
 ] 

Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 3:48 PM:


Please attach your config.xml, or are you using default settings?


was (Author: tilman):
Please attach your config.xml

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4277) PDF parse issue for text rotated

2024-07-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17865142#comment-17865142
 ] 

Tilman Hausherr commented on TIKA-4277:
---

Please attach your config.xml

> PDF parse issue for text rotated
> 
>
> Key: TIKA-4277
> URL: https://issues.apache.org/jira/browse/TIKA-4277
> Project: Tika
>  Issue Type: Bug
>  Components: tika-app, tika-server
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: ragebear
>Priority: Major
> Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-723) Rotated text isn't extracted correctly from PDFs

2024-07-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-723.

Resolution: Duplicate

Duplicate of TIKA-2779

> Rotated text isn't extracted correctly from PDFs
> 
>
> Key: TIKA-723
> URL: https://issues.apache.org/jira/browse/TIKA-723
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Priority: Minor
> Attachments: rotated.pdf
>
>
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> So
> m
> e
>  
> r
> o
> t
> a
> t
> e
> d
>  
> t
> e
> x
> t
> ,
>  
> h
> e
> r
> e
> !
> {noformat}
> I'm able to copy/paste the text out correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4276) Tika fails to detect damaged pdf

2024-07-10 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4276:
--
Description: 
We use Tika to check file type and extension. However, with some damaged pdf 
files Tika detects them as text file.

Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
extension.

Following is the sample code and the link to the tika-config.xml and the sample 
PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2 and POI version is 5.2.3.   

 

 
{code:java}
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeType;
 
import java.io.FileInputStream;
 
public class DetectDamagedPDF {
 
public static void main(String args[]) {
try {
String filePath = 
"/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";
TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");
Detector detector = config.getDetector();
Metadata metadata = new Metadata();
FileInputStream fis = new FileInputStream(filePath);
TikaInputStream stream = TikaInputStream.get(fis);
metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);
MediaType mediaType = detector.detect(stream, metadata);
MimeType mimeType = 
config.getMimeRepository().forName(mediaType.toString());
String tikaExtension = mimeType.getExtension();
System.out.println("tikaExtension = " + tikaExtension);
}
catch(Exception ex) {
ex.printStackTrace();
}
}
}
{code}
 

  was:
We use Tika to check file type and extension. However, with some damaged pdf 
files Tika detects them as text file.

Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
extension.

Following is the sample code and the link to the tika-config.xml and the sample 
PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2 and POI version is 5.2.3.   

 

 

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.mime.MimeType;

 

import java.io.FileInputStream;

 

public class DetectDamagedPDF {

 

    public static void main(String args[]) {

    try

{     String filePath = 
"/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";     
TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");    
 Detector detector = config.getDetector();     Metadata metadata = 
new Metadata();     FileInputStream fis = new 
FileInputStream(filePath);     TikaInputStream stream = 
TikaInputStream.get(fis);     
metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);     
MediaType mediaType = detector.detect(stream, metadata);     MimeType 
mimeType = config.getMimeRepository().forName(mediaType.toString());    
 String tikaExtension = mimeType.getExtension();     
System.out.println("tikaExtension = " + tikaExtension);     }

    catch(Exception ex)

{     ex.printStackTrace();     }

    }

}

 


> Tika fails to detect damaged pdf
> 
>
> Key: TIKA-4276
> URL: https://issues.apache.org/jira/browse/TIKA-4276
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use Tika to check file type and extension. However, with some damaged pdf 
> files Tika detects them as text file.
> Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
> extension.
> Following is the sample code and the link to the tika-config.xml and the 
> sample PDF file is 
> [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2 and POI version is 5.2.3.   
>  
>  
> {code:java}
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.io.TikaInputStream;
> impo

[jira] [Commented] (TIKA-4276) Tika fails to detect damaged pdf

2024-07-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17864670#comment-17864670
 ] 

Tilman Hausherr commented on TIKA-4276:
---

Your file starts with "1 0 obj" instead of with "%PDF" so I'd say this isn't a 
bug. The file is truncated at the beginning, and it could be truncated 
anywhere. We'd need countless magic numbers.

> Tika fails to detect damaged pdf
> 
>
> Key: TIKA-4276
> URL: https://issues.apache.org/jira/browse/TIKA-4276
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use Tika to check file type and extension. However, with some damaged pdf 
> files Tika detects them as text file.
> Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
> extension.
> Following is the sample code and the link to the tika-config.xml and the 
> sample PDF file is 
> [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2 and POI version is 5.2.3.   
>  
>  
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.mime.MimeType;
>  
> import java.io.FileInputStream;
>  
> public class DetectDamagedPDF {
>  
>     public static void main(String args[]) {
>     try
> {     String filePath = 
> "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";     
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");  
>    Detector detector = config.getDetector();     Metadata 
> metadata = new Metadata();     FileInputStream fis = new 
> FileInputStream(filePath);     TikaInputStream stream = 
> TikaInputStream.get(fis);     
> metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);     
> MediaType mediaType = detector.detect(stream, metadata);     MimeType 
> mimeType = config.getMimeRepository().forName(mediaType.toString());  
>    String tikaExtension = mimeType.getExtension();     
> System.out.println("tikaExtension = " + tikaExtension);     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
> }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4274) Improve ExtractReaderException

2024-07-07 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4274.
---
Resolution: Fixed

> Improve ExtractReaderException
> --
>
> Key: TIKA-4274
> URL: https://issues.apache.org/jira/browse/TIKA-4274
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-eval
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> I saw this stack trace in the eval log and it's not really helpful
> {noformat}
> org.apache.tika.eval.app.io.ExtractReaderException
>   at 
> org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:125)
>   at 
> org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198)
>   at 
> org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180)
>   at 
> org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152)
>   at 
> org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87)
>   at 
> org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}
> so I'm adding the type, the cause and also some logging for 
> EXTRACT_FILE_TOO_SHORT / EXTRACT_FILE_TOO_LONG so that we can know what this 
> is about, and then do something (or not) about it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4274) Improve ExtractReaderException

2024-07-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863552#comment-17863552
 ] 

Tilman Hausherr commented on TIKA-4274:
---

new output:
{noformat}
INFO  [pool-3-thread-4] 11:41:41,973 org.apache.tika.eval.app.io.ExtractReader 
maxExtractLength 200 > IGNORE_LENGTH -1 and length 2587452 > 
maxExtractLength 200
org.apache.tika.eval.app.io.ExtractReaderException: EXTRACT_FILE_TOO_LONG
at 
org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:129)
at 
org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198)
at 
org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180)
at 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}

> Improve ExtractReaderException
> --
>
> Key: TIKA-4274
> URL: https://issues.apache.org/jira/browse/TIKA-4274
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-eval
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> I saw this stack trace in the eval log and it's not really helpful
> {noformat}
> org.apache.tika.eval.app.io.ExtractReaderException
>   at 
> org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:125)
>   at 
> org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198)
>   at 
> org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180)
>   at 
> org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152)
>   at 
> org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87)
>   at 
> org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}
> so I'm adding the type, the cause and also some logging for 
> EXTRACT_FILE_TOO_SHORT / EXTRACT_FILE_TOO_LONG so that we can know what this 
> is about, and then do something (or not) about it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4274) Improve ExtractReaderException

2024-07-07 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4274:
-

 Summary: Improve ExtractReaderException
 Key: TIKA-4274
 URL: https://issues.apache.org/jira/browse/TIKA-4274
 Project: Tika
  Issue Type: Improvement
  Components: tika-eval
Affects Versions: 2.9.2
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 3.0.0, 2.9.3


I saw this stack trace in the eval log and it's not really helpful
{noformat}
org.apache.tika.eval.app.io.ExtractReaderException
at 
org.apache.tika.eval.app.io.ExtractReader.loadExtract(ExtractReader.java:125)
at 
org.apache.tika.eval.app.ExtractComparer.compareFiles(ExtractComparer.java:198)
at 
org.apache.tika.eval.app.ExtractComparer.processFileResource(ExtractComparer.java:180)
at 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:152)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:87)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}
so I'm adding the type, the cause and also some logging for 
EXTRACT_FILE_TOO_SHORT / EXTRACT_FILE_TOO_LONG so that we can know what this is 
about, and then do something (or not) about it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


jdk22 build

2024-07-06 Thread Tilman Hausherr
I've set up a jdk22 build (renamed an older one). However some tests 
fail, I've opened


https://issues.apache.org/jira/browse/INFRA-25943

Tilman




Re: 3.0.0-BETA2 next week?

2024-07-03 Thread Tilman Hausherr

Hi,

Sure... there's currently a CVE problem with tika-dl (Deep Learning) 
related to ffmpeg version "6.1.1-1.5.10". I got rid of it by excluding 
ffmpeg and the tests still work. Is tika-dl meant to use videos too? 
Apparently yes: https://github.com/apache/tika/pull/165


Tilman

On 03.07.2024 22:03, Tim Allison wrote:

All,
   I think it is time to go for a 3.0.0-BETA2. What do you think about
cutting that release this Friday or maybe next week?

   Best,

  Tim





[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-07-02 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861555#comment-17861555
 ] 

Tilman Hausherr commented on TIKA-4181:
---

PR 1849 has now succeeded.

> Tika Grpc Server using Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-07-02 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861363#comment-17861363
 ] 

Tilman Hausherr commented on TIKA-4181:
---

As a first step I've updated protobuf to current in the grpc subproject and 
excluded a dependency. We'll see what else will succeed. If there's anything 
that stops working but isn't shown by the tests please revert and add a comment 
in the pom.xml.

> Tika Grpc Server using Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-07-01 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861075#comment-17861075
 ] 

Tilman Hausherr edited comment on TIKA-4181 at 7/1/24 7:02 AM:
---

Is this
{code:xml}
3.24.0
3.24.0
{code}
intended? This is an older version of protobuf than the one we're using. It's 
also preventing this PR to work:
https://github.com/apache/tika/pull/1849

{noformat}
2024-07-01T06:17:39.8130959Z [WARNING] Rule 0: 
org.apache.maven.plugins.enforcer.DependencyConvergence failed with message:
2024-07-01T06:17:39.8132291Z Failed while enforcing releasability the error(s) 
are [
2024-07-01T06:17:39.8133867Z Dependency convergence error for 
com.google.protobuf:protobuf-java-util:3.25.1 paths to dependency are:
2024-07-01T06:17:39.8135252Z +-org.apache.tika:tika-grpc:3.0.0-SNAPSHOT
2024-07-01T06:17:39.8136080Z   +-io.grpc:grpc-services:1.65.0
2024-07-01T06:17:39.8136947Z +-com.google.protobuf:protobuf-java-util:3.25.1
2024-07-01T06:17:39.8137737Z and
2024-07-01T06:17:39.8138366Z +-org.apache.tika:tika-grpc:3.0.0-SNAPSHOT
2024-07-01T06:17:39.8139307Z   +-com.google.protobuf:protobuf-java-util:3.24.0
{noformat}



was (Author: tilman):
Is this
{code:xml}
3.24.0
3.24.0
{code}
intended? This is an older version of protobuf than the one we're using. It's 
also preventing this PR to work:
https://github.com/apache/tika/pull/1849

> Tika Grpc Server using Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-07-01 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17861075#comment-17861075
 ] 

Tilman Hausherr commented on TIKA-4181:
---

Is this
{code:xml}
3.24.0
3.24.0
{code}
intended? This is an older version of protobuf than the one we're using. It's 
also preventing this PR to work:
https://github.com/apache/tika/pull/1849

> Tika Grpc Server using Tika Pipes
> -
>
> Key: TIKA-4181
> URL: https://issues.apache.org/jira/browse/TIKA-4181
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
> Attachments: image-2024-02-06-07-54-50-116.png
>
>
> Create a Tika Grpc server.
> You should be able to create Tike Pipes fetchers, then use those fetchers. 
> You can then use those fetchers to FetchAndParse in 3 ways:
>  * synchronous fashion - you send a single request to fetch a file, and get a 
> single FetchAndParse response tuple.
>  * streaming output - you send a single request and stream back the 
> FetchAndParse response tuple.
>  * bi-directional streaming - You stream in 1 or more Fetch requests and 
> stream back FetchAndParse response tuples.
> Requires we create a service contract that specifies the inputs we require 
> from each method.
> Then we will need to implement the different components with a grpc client 
> generated using the contract.
> This would enable developers to run tika-pipes as a persistently running 
> daemon instead of just a single batch app, because it can continue to stream 
> out more inputs.
> !image-2024-02-06-07-54-50-116.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859718#comment-17859718
 ] 

Tilman Hausherr commented on TIKA-4251:
---

I'm wondering if this means lots of changes to check at the beginning. This is 
the kindof plugin that would be ideal for a supply chain attack.

> [DISCUSS] move to cosium's git-code-format-maven-plugin with 
> google-java-format
> ---
>
> Key: TIKA-4251
> URL: https://issues.apache.org/jira/browse/TIKA-4251
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I was recently working a bit on incubator-stormcrawler, and I noticed that 
> they are using cosium's git-code-format-maven-plugin: 
> https://github.com/Cosium/git-code-format-maven-plugin
> I was initially annoyed that I couldn't quickly figure out what I had to fix 
> to make the linter happyl, but then I realized there was a magic command: 
> {{mvn git-code-format:format-code}} which just fixed the code so that the 
> linter passed. 
> The one drawback I found is that it does not fix nor does it alert on 
> wildcard imports.  We could still use checkstyle for that but only have one 
> rule for checkstyle.
> The other drawback is that there is not a lot of room for variation from 
> google's style. This may actually be a benefit, too, of course.
> I just ran this on {{tika-core}} here: 
> https://github.com/apache/tika/tree/google-java-format
> What would you think about making this change for 3.x?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4270) wrong skew angle in tika-parser-ocr-module

2024-06-20 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4270:
--
Description: 
We use tika to extract text from different sources, including images with text 
that is rotated at a certain angle. To extract text from image with ocr, tika 
first deskew image. The skew angle is not calculated correctly. In example 
[^for_issue] (PNG file), the text is rotated at an angle of ~40 degrees. But 
the skew angle function 
(org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle 
of about 15. The slope angle calculation flag is enabled.

The documentation 
(https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation)
 does not have sufficient information for this version of tika, there is a todo 
box and some relevant information for tika 1 (requires python and its 
libraries, but in the version of tika we use, angle calculations are 
implemented only using java)

  was:
We use tika to extract text from different sources, including images with text 
that is rotated at a certain angle. To extract text from image with ocr, tika 
first deskew image. The skew angle is not calculated correctly. In example 
[^for_issue] , the text is rotated at an angle of ~40 degrees. But the skew 
angle function (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) 
returns an angle of about 15. The slope angle calculation flag is enabled.

The documentation 
(https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation)
 does not have sufficient information for this version of tika, there is a todo 
box and some relevant information for tika 1 (requires python and its 
libraries, but in the version of tika we use, angle calculations are 
implemented only using java)


> wrong skew angle in tika-parser-ocr-module
> --
>
> Key: TIKA-4270
> URL: https://issues.apache.org/jira/browse/TIKA-4270
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.1
>Reporter: Roman
>Priority: Major
> Attachments: for_issue
>
>
> We use tika to extract text from different sources, including images with 
> text that is rotated at a certain angle. To extract text from image with ocr, 
> tika first deskew image. The skew angle is not calculated correctly. In 
> example [^for_issue] (PNG file), the text is rotated at an angle of ~40 
> degrees. But the skew angle function 
> (org.apache.tika.parser.ocr.tess4j.ImageDeskew#getSkewAngle) returns an angle 
> of about 15. The slope angle calculation flag is enabled.
> The documentation 
> (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#:~:text=To%20identify%20rotation)
>  does not have sufficient information for this version of tika, there is a 
> todo box and some relevant information for tika 1 (requires python and its 
> libraries, but in the version of tika we use, angle calculations are 
> implemented only using java)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-10 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4267.
-
Resolution: Invalid

Closing for now, please comment and/or reopen if needed.

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4267:
--
Summary: Not getting correct mime type for a few file extensions. example: 
csv  (was: Not getting correct mimet type for few file extensions. example :csv)

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4267:
--
Affects Version/s: 1.28.4

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:06 PM:


The current version is 2.9.2, please retry with that one.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}


was (Author: tilman):
The current version is 2.9.2, please retry with that one.

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:07 PM:


The current version is 2.9.2, please retry with that one; if it still doesn't 
work, please attach your csv file.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}


was (Author: tilman):
The current version is 2.9.2, please retry with that one.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr commented on TIKA-4267:
---

The current version is 2.9.2, please retry with that one.

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-1907) Big Pdf parsing to text - Out of memory

2024-05-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1907:
--
Fix Version/s: 3.0.0

> Big Pdf parsing to text - Out of memory
> ---
>
> Key: TIKA-1907
> URL: https://issues.apache.org/jira/browse/TIKA-1907
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Nicolas Daniels
>Priority: Major
> Fix For: 3.0.0
>
>
> Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]
> I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe 
> PDFBox is not the appropriate lib to use in such case.
> Trying to read the same PDF using Tika leads to the same problem:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
> InputStream inputStream = new 
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
> try {
>  StringWriter writer = new StringWriter();
>FileWriter fileWriter = new FileWriter(new 
> File("c:/tmp/test.txt"));
>   BodyContentHandler handler = new BodyContentHandler(fileWriter);
>   Metadata metadata = new Metadata();
>   new PDFParser().parse(inputStream, handler, metadata, new 
> ParseContext());
>  fileWriter.close();
> } finally {
> inputStream.close();
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845590#comment-17845590
 ] 

Tilman Hausherr edited comment on TIKA-4254 at 5/12/24 9:40 AM:


THausherr commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546

   Maybe I get it: {{repo = config.getMimeRepository();}} isn't creating 
anything new, it's retrieving something that is changed later by the test? If 
my understanding is correct then it's a deeper problem.





was (Author: githubbot):
THausherr commented on PR #1754:
URL: https://github.com/apache/tika/pull/1754#issuecomment-2105679546

   Maybe I get it: `repo = config.getMimeRepository();` isn't creating anything 
new, it's retrieving something that is changed later by the test? If my 
understanding is correct then it's a deeper problem.




> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIOInspector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845566#comment-17845566
 ] 

Tilman Hausherr commented on TIKA-4254:
---

Why would we ever run the test twice in the same environment?

> The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the 
> first run and fails in repeated runs in the same environment. 
> 
>
> Key: TIKA-4254
> URL: https://issues.apache.org/jira/browse/TIKA-4254
> Project: Tika
>  Issue Type: Bug
>Reporter: Kaiyao Ke
>Priority: Major
>
> ### Brief Description of the Bug
> The test `TestMimeTypes#testJavaRegex` is non-idempotent, as it passes in the 
> first run but fails in the second run in the same environment. The source of 
> the problem is that each test execution initializes a new media type 
> (`MimeType`) instance `testType` (same problem for `testType2`), and all 
> media types across different test executions attempt to use the same name 
> pattern `"rtg_sst_grb_0\\.5\\.\\d{8}"`. Therefore, in the second execution of 
> the test, the line `this.repo.addPattern(testType, pattern, true);` will 
> throw an error, since the name pattern is already used by the `testType` 
> instance initiated from the first test execution. Specifically, in the second 
> run, the `addGlob()` method of the `Pattern` class will assert conflict 
> patterns and throw a`MimeTypeException`(line 123 in `Patterns.java`).
> ### Failure Message in the 2nd Test Run:
> ```
> org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: 
> rtg_sst_grb_0\.5\.\d{8}
>   at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
>   at org.apache.tika.mime.Patterns.add(Patterns.java:71)
>   at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
>   at 
> org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
>   at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
> ```
> ### Reproduce
> Use the `NIOInspector` plugin that supports rerunning individual tests in the 
> same environment:
> ```
> cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
> mvn edu.illinois:NIODetector:rerun 
> -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
> ```
> ### Proposed Fix
> Declare `testType` and `testType2` as static variables and initialize them at 
> class loading time. Therefore, repeated runs of `testJavaRegex()` will not 
> conflict each other. All tests pass and are idempotent after the fix.
> ### Necessity of Fix
> A fix is recommended as unit tests shall be idempotent, and state pollution 
> shall be mitigated so that newly introduced tests do not fail in the future 
> due to polluted shared states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Bump dependabot to weekly?

2024-04-29 Thread Tilman Hausherr

Yes!

Tilman

On 29.04.2024 16:55, Tim Allison wrote:

Oh, interesting. Should we bump this value to, say, 20?


https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file#open-pull-requests-limit
?

Thank you, Tilman!

On Mon, Apr 29, 2024 at 10:47 AM Tilman Hausherr
wrote:


The positive side is that it's less interruptions.
One negative side is that there seems to be a maximum. Today it didn't
report the AWS update, which was detected in the past.
Tilman




Re: Bump dependabot to weekly?

2024-04-29 Thread Tilman Hausherr

The positive side is that it's less interruptions.
One negative side is that there seems to be a maximum. Today it didn't 
report the AWS update, which was detected in the past.

Tilman

On 29.04.2024 16:34, Tim Allison wrote:

The move to weekly dependabot has been a bit of a relief for me personally.
Our mail list isn't clogged w daily dependabot updates (and yes, I know I
can apply a filter :/).

How is it working for everyone else?

On Wed, Apr 10, 2024 at 4:09 PM Tim Allison  wrote:


you start deleting them reflexively out of your email!

Not Tilman!!!

Let's move to weekly and see how that works?

On Wed, Apr 10, 2024 at 3:57 PM Eric Pugh
 wrote:

Hence why I like the monthly unless it’s a special case….  The flood of

updates just means you start deleting them reflexively out of your email!
  Now, if you have a dependency and you’re maybe actively working on it, and
it’s changing quickly, then that might be an argument for daily.

On Apr 10, 2024, at 12:53 PM, Tilman Hausherr 

wrote:

I'm fine with daily because this way we can learn ASAP if there are

troubles with new dependency versions, although I'm now too busy.

Tilman



-- Original-Nachricht --
Von: Tim Allison 
Betreff: Bump dependabot to weekly?
Datum: 10.04.2024, 18:08 Uhr
An:  

All,
  Tilman has been doing heroic work keeping us up to date with
dependabot's PRs. Given our pace of releases, would it make sense to
backoff to weekly updates?
  Before running regression tests, we'd run the update plugin to make
sure that we're up to date.
  What do you think?

Best,

 Tim


___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 |

http://www.opensourceconnections.com <
http://www.opensourceconnections.com/> | My Free/Busy <
http://tinyurl.com/eric-cal>

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <

https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw

This e-mail and all contents, including attachments, is considered to be

Company Confidential unless explicitly stated otherwise, regardless of
whether attachments are marked as such.





[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840922#comment-17840922
 ] 

Tilman Hausherr commented on TIKA-4245:
---

The file claims to be utf-16 but it isn't. If I change it to utf-8 in the 
editor then I get an NPE in the GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840908#comment-17840908
 ] 

Tilman Hausherr commented on TIKA-4245:
---

Happens also with the tika app GUI.

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4245:
--
Description: 
We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 {code:java}
import org.apache.commons.io.FileUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
 
public class ExtractTxtFromHtml {
private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
 
public static void main(String args[]) {
extactText(false);
extactText(true);
}
 
static void extactText(boolean largeFile) {
PrintWriter outputFileWriter = null;
try {
BodyContentHandler handler;
Path outputFilePath = null;
 
if (largeFile) {
// write tika output to disk
outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));
handler = new BodyContentHandler(outputFileWriter);
} else {
// stream it in memory
handler = new BodyContentHandler(-1);
}
 
Metadata metadata = new Metadata();
FileInputStream inputData = new FileInputStream(inputFile.toFile());
TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
Parser autoDetectParser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(TikaConfig.class, config);
autoDetectParser.parse(inputData, handler, metadata, context);
 
String content;
if (largeFile) {
content = FileUtils.readFileToString(outputFilePath.toFile());
}
else {
content = handler.toString();
}
System.out.println("content = " + content);
}
catch(Exception ex) {
ex.printStackTrace();
} finally {
if (outputFileWriter != null) {
outputFileWriter.close();
}
}
}
}
{code}


  was:
We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 

import org.apache.commons.io.FileUtils;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

 

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import java.nio.file.Files;

import java.nio.file.Path;

import java.nio.file.Paths;

 

public class ExtractTxtFromHtml {

    private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();

 

    public static void main(String args[]) {

    extactText(false);

    extactText(true);

    }

 

    static void extactText(boolean largeFile) {

    PrintWriter outputFileWriter = null;

    try {

    BodyContentHandler handler;

    Path outputFilePath = null;

 

    if (largeFile) {

    // write tika output to disk

    outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");

    outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));

    handler = new BodyContentHandler(outputFileWriter);

    } else {

   

[jira] [Comment Edited] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745
 ] 

Tilman Hausherr edited comment on TIKA-4166 at 4/22/24 3:27 PM:


It turned out to be something different than the missing package. After 
googling for the error message I found an SO answer that I had upvoted in the 
past 
https://stackoverflow.com/a/54467008/535646


was (Author: tilman):
It turned out to be something different than the missing package. After 
googling for the error message I found an SO that I had upvoted in the past 
https://stackoverflow.com/a/54467008/535646

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>    Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839745#comment-17839745
 ] 

Tilman Hausherr commented on TIKA-4166:
---

It turned out to be something different than the missing package. After 
googling for the error message I found an SO that I had upvoted in the past 
https://stackoverflow.com/a/54467008/535646

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>    Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: How to proceed when you are getting OSS index errors?

2024-04-22 Thread Tilman Hausherr

Hi,

We look what the CVE is about. Some CVEs are irrelevant (see recent rant 
from Tim) and we can add an exclusion in the OSS section. Sometimes all 
what is needed is to update a dependency or add it in the management 
section or exclude it (in the assumptions that the tests cover everything).


About this case: it has been updated in the repository to exclude two 
threeten versions from OSS.


Tilman

On 22.04.2024 16:16, Nicholas DiPiazza wrote:

When getting these sorts of errors:

[ERROR] Failed to execute goal
org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit
(audit-dependencies) on project tika-dl: Detected 1 vulnerable components:
[ERROR]   org.threeten:threetenbp:jar:1.3.3:provided;
https://ossindex.sonatype.org/component/pkg:maven/org.threeten/threetenbp@1.3.3?utm_source=ossindex-client_medium=integration_content=1.8.1
[ERROR] * [CVE-2024-23081] CWE-476: NULL Pointer Dereference (3.7);
https://ossindex.sonatype.org/vulnerability/CVE-2024-23081?component-type=maven=org.threeten%2Fthreetenbp_source=ossindex-client_medium=integration_content=1.8.1
[ERROR] * [CVE-2024-23082] CWE-190: Integer Overflow or Wraparound
(5.3);
https://ossindex.sonatype.org/vulnerability/CVE-2024-23082?component-type=maven=org.threeten%2Fthreetenbp_source=ossindex-client_medium=integration_content=1.8.1
[ERROR]

How do you all typically proceed? Do I patch the issue and move on somehow?
How do i get my builds to work now that this error has happened?

-Nicholas





[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839652#comment-17839652
 ] 

Tilman Hausherr commented on TIKA-4166:
---

The latest Apache parent update means a javadoc update and it results in a 
failure on the ci:
{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-javadoc-plugin:3.6.3:aggregate (default-cli) on 
project tika: An error has occurred in Javadoc report generation:
[ERROR] Exit code: 2
[ERROR] javadoc: error - No source files for package org.apache.tika.extractor
[ERROR] Command line was: 
/usr/local/asfpackages/java/adoptium-jdk-11.0.16.1+1/bin/javadoc @options 
@packages
{noformat}
A possible cause for this could be that in tika-batch there is a test package 
that doesn't exist as a source package. It didn't happen locally for me because 
I didn't use "javadoc:aggregate". I'll do some more tests to see whether 
renaming the test package fixes this.

> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836236#comment-17836236
 ] 

Tilman Hausherr commented on TIKA-4240:
---

I prefer daily but if more people feel pressured or annoyed by these mails (I 
never felt that way) then I accept weekly.

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4240:
--
Component/s: build

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836224#comment-17836224
 ] 

Tilman Hausherr commented on TIKA-4240:
---

Not a burden (that was Eric, sort-of), I just don't have the time right now to 
fix the current build failure. I like the alerts, it's a low hanging fruit and 
also helps me to learn more about the code.

> Change dependabot to weekly
> ---
>
> Key: TIKA-4240
> URL: https://issues.apache.org/jira/browse/TIKA-4240
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> On the list, I proposed this change. Some were in favor of dropping it back 
> to monthly. [~tilman] made the argument for the benefit of seeing problems 
> quickly and also acknowledged that it is a burden to merge the daily PRs.
> I propose bumping dependabot back to weekly for a bit, and we'll see how it 
> works as a middle ground.
> If anyone feels strongly about moving back to daily, we can do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


AW: Bump dependabot to weekly?

2024-04-10 Thread Tilman Hausherr
I'm fine with daily because this way we can learn ASAP if there are troubles 
with new dependency versions, although I'm now too busy.

Tilman 



-- Original-Nachricht --
Von: Tim Allison 
Betreff: Bump dependabot to weekly?
Datum: 10.04.2024, 18:08 Uhr
An:  

All,
  Tilman has been doing heroic work keeping us up to date with
dependabot's PRs. Given our pace of releases, would it make sense to
backoff to weekly updates?
  Before running regression tests, we'd run the update plugin to make
sure that we're up to date.
  What do you think?

Best,

 Tim



[jira] [Commented] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529
 ] 

Tilman Hausherr commented on TIKA-4238:
---

This was a low-hanging fruit. I could also have done 
UnsynchronizedByteArrayInputStream, but replacing that one would not only would 
make the code much bigger, it would also require to catch an exception that 
isn't thrown now, so lets just wait what they do.
https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get()

> replace some deprecated code
> 
>
> Key: TIKA-4238
> URL: https://issues.apache.org/jira/browse/TIKA-4238
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>    Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834529#comment-17834529
 ] 

Tilman Hausherr edited comment on TIKA-4238 at 4/6/24 2:12 PM:
---

This was a low-hanging fruit. I could also have done 
UnsynchronizedByteArrayInputStream, but replacing that one would not only make 
the code much bigger, it would also require to catch an exception that isn't 
thrown now, so lets just wait what they do in the future.
https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get()


was (Author: tilman):
This was a low-hanging fruit. I could also have done 
UnsynchronizedByteArrayInputStream, but replacing that one would not only would 
make the code much bigger, it would also require to catch an exception that 
isn't thrown now, so lets just wait what they do.
https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/UnsynchronizedByteArrayInputStream.Builder.html#get()

> replace some deprecated code
> 
>
> Key: TIKA-4238
> URL: https://issues.apache.org/jira/browse/TIKA-4238
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>    Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4218:
--
Affects Version/s: 2.9.1

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.1
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4218.
---
  Assignee: Tim Allison
Resolution: Fixed

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned TIKA-4171:
-

Assignee: Tim Allison

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Assignee: Tim Allison
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4218:
--
Fix Version/s: 2.9.2

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4171.
---
Resolution: Fixed

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 2.9.2, 3.0.0-BETA
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4238.
---
Resolution: Fixed

> replace some deprecated code
> 
>
> Key: TIKA-4238
> URL: https://issues.apache.org/jira/browse/TIKA-4238
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>    Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4239) Update to 2.9.3

2024-04-06 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4239:
-

 Summary: Update to 2.9.3
 Key: TIKA-4239
 URL: https://issues.apache.org/jira/browse/TIKA-4239
 Project: Tika
  Issue Type: Task
  Components: build
Reporter: Tilman Hausherr






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4239) Update to 2.9.3

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4239:
--
Affects Version/s: 2.9.2

> Update to 2.9.3
> ---
>
> Key: TIKA-4239
> URL: https://issues.apache.org/jira/browse/TIKA-4239
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4162) Update to 2.9.2

2024-04-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4162.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

> Update to 2.9.2
> ---
>
> Key: TIKA-4162
> URL: https://issues.apache.org/jira/browse/TIKA-4162
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.1
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.9.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4238) replace some deprecated code

2024-04-06 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4238:
-

 Summary: replace some deprecated code
 Key: TIKA-4238
 URL: https://issues.apache.org/jira/browse/TIKA-4238
 Project: Tika
  Issue Type: Task
Affects Versions: 2.9.2
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 3.0.0, 2.9.3






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


2.9.2 / 2.9.3 admin

2024-04-05 Thread Tilman Hausherr
I've created 2.9.3 version in JIRA administration. Someone (Tim?) please 
set the 2.9.2 version to released or whatever (I didn't want to touch 
that part)


Tilman



[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4236:
--
Fix Version/s: 2.9.3

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4236:
--
Fix Version/s: (was: 2.9.2)

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Priority: Major
> Fix For: 3.0.0
>
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4236) tika-parser-nlp-module has an unnecessary Guava dependency

2024-04-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4236.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

> tika-parser-nlp-module has an unnecessary Guava dependency
> --
>
> Key: TIKA-4236
> URL: https://issues.apache.org/jira/browse/TIKA-4236
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.28.5, 3.0.0-BETA, 2.9.2
>Reporter: Manfred Baedke
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> This should be avoided, because it's prone to maintenance and security 
> problems.
> It's easy to get rid of it: the class 
> {{o.a.t.parser.geo.topic.gazetteer.GeoGazetteerClient}} uses 
> {{{}com.google.common.reflect.TypeToken{}}}. Since the project uses gson 
> anyway, it could just be replaced with 
> {{{}com.google.gson.reflect.TypeToken{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >