[jira] [Assigned] (TIKA-4317) Abusive content on https://corpora.tika.apache.org/

2024-10-14 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned TIKA-4317:
-

Assignee: Tim Allison

> Abusive content on https://corpora.tika.apache.org/
> ---
>
> Key: TIKA-4317
> URL: https://issues.apache.org/jira/browse/TIKA-4317
> Project: Tika
>  Issue Type: Bug
>  Components: site
>Reporter: Zoran Regvart
>Assignee: Tim Allison
>Priority: Major
>
> The Apache Camel team has been notified by Google of abusive content hosted 
> on https://corpora.tika.apache.org/, with the assumption that this is somehow 
> related to https://camel.apache.org. The scanning done by Google is against 
> the whole apache.org domain, so implication is that any abusive content found 
> on any domain within apache.org will be accredited and affect other domains 
> within apache.org.
> Learn about abusive experiences here: 
> https://support.google.com/webtools/answer/7347327.
> Singled out page from Google report (content & possibly security warning):
> {code}https://corpora.tika.apache.org/base/docs/commoncrawl3/QK/QKKJTNDRIVLIPP7433IFC3EF3UVOSPIB{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888662#comment-17888662
 ] 

Tilman Hausherr commented on TIKA-4278:
---

new test result with the latest changes and the colon added in the default 
configuration, the results are the same: 
[^reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz]

 

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Attachment: reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888651#comment-17888651
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 10/11/24 1:17 PM:
-

Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No 
surprises here. However the test runs only on .csv files so it misses some of 
the files mentioned in the previous report.

(This does not yet contain the latest change, and didn't include the colon)


was (Author: tilman):
Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No 
surprises here. However the test runs only on .csv files so it misses some of 
the files mentioned in the previous report.

(This does not yet contain the latest change)

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888651#comment-17888651
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 10/11/24 1:16 PM:
-

Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No 
surprises here. However the test runs only on .csv files so it misses some of 
the files mentioned in the previous report.

(This does not yet contain the latest change)


was (Author: tilman):
Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No 
surprises here. However the test runs only on .csv files so it misses some of 
the files mentioned in the previous report.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-11 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888651#comment-17888651
 ] 

Tilman Hausherr commented on TIKA-4278:
---

Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No 
surprises here. However the test runs only on .csv files so it misses some of 
the files mentioned in the previous report.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Attachment: reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, 
> reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888326#comment-17888326
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 10/10/24 3:40 PM:
-

1 and 2, i.e. set a user modifiable default configuration that doesn't contain 
the colons. I'd love to hear other opinions.


was (Author: tilman):
1 and 2, i.e. set a default configuration that doesn't contain the colons. I'd 
love to hear other opinions.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>    Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-10-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888326#comment-17888326
 ] 

Tilman Hausherr commented on TIKA-4278:
---

1 and 2, i.e. set a default configuration that doesn't contain the colons. I'd 
love to hear other opinions.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-10-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888321#comment-17888321
 ] 

Tilman Hausherr commented on TIKA-4280:
---

One weird thing: commoncrawl3/2P/2PSMEFJEYU7EPAZXQQDD6OL2WOQLBJRY, this is a 
compressed file. In "A" it appears as "application/json; charset=ISO-8859-1", 
in "B" as "text/csv; charset=ISO-8859-1; delimiter=colon". The file itself 
starts with "PK" so shouldn't this be easy?

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-10-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888256#comment-17888256
 ] 

Tilman Hausherr commented on TIKA-4280:
---

csv changes => TIKA-4278

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


build timeout fails

2024-10-04 Thread Tilman Hausherr

https://issues.apache.org/jira/browse/INFRA-26175



Re: 3.0.0 release?

2024-09-24 Thread Tilman Hausherr
This is weird...  Tika itself depends on Apache CXF, it currently uses 
4.0.5.


I couldn't believe it but then I looked 
https://github.com/apache/cxf/blob/main/parent/pom.xml and it's true... 
a search finds more:

https://github.com/search?q=repo%3Aapache%2Fcxf%20tika&type=code

Tilman

On 24.09.2024 15:29, Gary D. Gregory wrote:

Hi All,

Is there a time frame for 3.0.0? It looks like Apache CXF 4.1.0 depends on 
3.0.0 [1] and I'm waiting on CXF 4.1.0... Any guidance would be appreciated.

TY!
Gary
[1] https://issues.apache.org/jira/browse/CXF-8671

On 2024/08/21 17:08:22 Nicholas DiPiazza wrote:

I have a pull request for some class path loading fixes for Tika grpc.
Hoping to get that done today but it's a struggle so far

On Wed, Aug 21, 2024, 11:30 AM Tim Allison  wrote:


All,

   There are a couple of items documented on
https://issues.apache.org/jira/browse/TIKA-4280 that we wanted to take
care
of before the 3.0.0 release.

   I can run a comparison btwn 2.x and 3.x on our regression corpus, and I
can try to deal with javadocs.

   Any recs on how to wrap up the other issues? Are there any other blockers
not listed on that issue?

   Thank you!

Best,

 Tim





[jira] [Closed] (TIKA-2619) Memory leak: PDF meta data detection fails with OutOfMemoryError

2024-09-23 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-2619.
-
Resolution: Duplicate

> Memory leak: PDF meta data detection fails with OutOfMemoryError
> 
>
> Key: TIKA-2619
> URL: https://issues.apache.org/jira/browse/TIKA-2619
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: Linux 4.13.0-37 / JDK 1.8.0_152
>Reporter: Felix Dürrwanger
>Priority: Critical
> Attachments: Bundesministerium.pdf
>
>
> When analysing the attached PDF with TIKA (embedded and server) the JVM 
> consumes all available memory and fails with an OutOfMemoryError. The PDF is 
> an offical and public document from a german federal ministry.
>  
> *Client*:
> {noformat}
> fd@804F9H2:~/TIKA$ time curl -T Bundesministerium.pdf 
> http://127.0.0.1:9998/meta --header "Accept: application/json"
> Error: 500
> real 0m53.417s
> user 0m0.020s
> sys 0m0.011s
> {noformat}
>  
> *Server*:
> {noformat}
> fd@804F9H2:~/TIKA$ java -Xmx12G -jar tika-server-1.17.jar 
> Mar 29, 2018 12:25:34 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> TIFFImageWriter not loaded. tiff files will not be processed
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> Mar 29, 2018 12:25:34 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> INFO Starting Apache Tika 1.17 server
> INFO Setting the server's publish address to be http://localhost:9998/
> INFO jetty-8.y.z-SNAPSHOT
> INFO Started SelectChannelConnector@localhost:9998
> INFO Started Apache Tika server at http://localhost:9998/
> INFO meta (autodetecting type)
> WARN Application \{http://resource.server.tika.apache.org/}MetadataResource 
> has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: Java heap space
>  at 
> org.apache.cxf.service.invoker.AbstractInvoker.createFault(AbstractInvoker.java:163)
>  at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:129)
>  at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:202)
>  at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:101)
>  at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>  at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>  at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>  at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>  at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)
>  at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>  at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>  at org.eclipse.jetty.server.Server.handle(Server.java:370)
>  at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>  at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)
>  at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
>  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:647)
>  at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)
>  at 
> org.eclipse.jetty.server.AsyncHttpConnecti

[jira] [Updated] (TIKA-4311) Avoid potential ClassCastException in angle detection PDF extraction

2024-09-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4311:
--
Description: There is a programming error in PDFBox ExtractText now fixed 
in PDFBOX-5879, I'll fix the same code that is in PDF2XHTML here, although I 
haven't been able to reproduce the ClassCastException.  (was: There is a 
programming error in ExtractText now fixed in PDFBOX-5879, I'll fix it here 
too, although I haven't been able to reproduce the ClassCastException.)

> Avoid potential ClassCastException in angle detection PDF extraction
> 
>
> Key: TIKA-4311
> URL: https://issues.apache.org/jira/browse/TIKA-4311
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> There is a programming error in PDFBox ExtractText now fixed in PDFBOX-5879, 
> I'll fix the same code that is in PDF2XHTML here, although I haven't been 
> able to reproduce the ClassCastException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4311) Avoid potential ClassCastException in angle detection PDF extraction

2024-09-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4311.
---
Resolution: Fixed

> Avoid potential ClassCastException in angle detection PDF extraction
> 
>
> Key: TIKA-4311
> URL: https://issues.apache.org/jira/browse/TIKA-4311
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> There is a programming error in PDFBox ExtractText now fixed in PDFBOX-5879, 
> I'll fix the same code that is in PDF2XHTML here, although I haven't been 
> able to reproduce the ClassCastException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4311) Avoid potential ClassCastException in angle detection PDF extraction

2024-09-17 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4311:
-

 Summary: Avoid potential ClassCastException in angle detection PDF 
extraction
 Key: TIKA-4311
 URL: https://issues.apache.org/jira/browse/TIKA-4311
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.9.2, 3.0.0-BETA
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 3.0.0, 2.9.3


There is a programming error in ExtractText now fixed in PDFBOX-5879, I'll fix 
it here too, although I haven't been able to reproduce the ClassCastException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32

2024-09-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4308.
---
Fix Version/s: 3.0.0
   2.9.3
   Resolution: Fixed

Thanks for the report!

> ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
> ---
>
> Key: TIKA-4308
> URL: https://issues.apache.org/jira/browse/TIKA-4308
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Alexey Pelykh
>Assignee: Tilman Hausherr
>Priority: Trivial
>  Labels: easyfix
> Fix For: 3.0.0, 2.9.3
>
>
> It seems that a PE executable for 64-bit platform should return 
> MACHINE_x86_64, not MACHINE_x86_32:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32

2024-09-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4308:
--
Priority: Trivial  (was: Major)

> ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
> ---
>
> Key: TIKA-4308
> URL: https://issues.apache.org/jira/browse/TIKA-4308
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Alexey Pelykh
>Assignee: Tilman Hausherr
>Priority: Trivial
>  Labels: easyfix
>
> It seems that a PE executable for 64-bit platform should return 
> MACHINE_x86_64, not MACHINE_x86_32:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32

2024-09-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4308:
--
Component/s: parser

> ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
> ---
>
> Key: TIKA-4308
> URL: https://issues.apache.org/jira/browse/TIKA-4308
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Alexey Pelykh
>Assignee: Tilman Hausherr
>Priority: Major
>
> It seems that a PE executable for 64-bit platform should return 
> MACHINE_x86_64, not MACHINE_x86_32:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32

2024-09-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4308:
--
Labels: easyfix  (was: )

> ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
> ---
>
> Key: TIKA-4308
> URL: https://issues.apache.org/jira/browse/TIKA-4308
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Alexey Pelykh
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: easyfix
>
> It seems that a PE executable for 64-bit platform should return 
> MACHINE_x86_64, not MACHINE_x86_32:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32

2024-09-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4308:
--
Affects Version/s: 2.9.2

> ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
> ---
>
> Key: TIKA-4308
> URL: https://issues.apache.org/jira/browse/TIKA-4308
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Alexey Pelykh
>    Assignee: Tilman Hausherr
>Priority: Major
>
> It seems that a PE executable for 64-bit platform should return 
> MACHINE_x86_64, not MACHINE_x86_32:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32

2024-09-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned TIKA-4308:
-

Assignee: Tilman Hausherr

> ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
> ---
>
> Key: TIKA-4308
> URL: https://issues.apache.org/jira/browse/TIKA-4308
> Project: Tika
>  Issue Type: Bug
>Reporter: Alexey Pelykh
>    Assignee: Tilman Hausherr
>Priority: Major
>
> It seems that a PE executable for 64-bit platform should return 
> MACHINE_x86_64, not MACHINE_x86_32:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text

2024-08-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877655#comment-17877655
 ] 

Tilman Hausherr commented on TIKA-3970:
---

Rather the file in TIKA-4303 is missing the chinese text that was there before 
the commit of 2023. This happens only in 2.9.2 but not in 3.0 despite that both 
commits are identical.

> Certain OneNote documents produce duplicate text
> 
>
> Key: TIKA-3970
> URL: https://issues.apache.org/jira/browse/TIKA-3970
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 2.7.0
>Reporter: David Avant
>Priority: Minor
> Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, 
> lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, 
> lyrics.txt
>
>
> Extracting text from certain OneNote documents produces more text than is 
> actually in the document. In this case, the OneNote document was created 
> by opening a Word document and "printing" it to the OneNote.
> To reproduce the issue, open the attached "lyrics.one" using the Tika App 
> version 2.7.0 and view the plain text. Look for the phrase "Sunday 
> Morning" and observe that there are 14 occurrences.    However in the actual 
> displayed text, it occurs only once.  
> The original text in this document is only about 12K characters, but the 
> extracted text from tika is over 300K.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

2024-08-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877653#comment-17877653
 ] 

Tilman Hausherr commented on TIKA-4303:
---

I tried reverting TIKA-3970 and now I get what you have. However the commit 
diff in both branches are absolutely identical. Very weird.

> Unable to extract Chinese content in onenote
> 
>
> Key: TIKA-4303
> URL: https://issues.apache.org/jira/browse/TIKA-4303
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.8.0, 2.9.2
>Reporter: lqangi
>Priority: Major
> Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using 
> tika, the Chinese part of the file could not be extracted, only the 
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in 
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
> extract the historical version of the data along with the extraction, I don't 
> know if this issue (TIKA-3970) has been fixed (I see that the code has been 
> committed on github, But it doesn't seem to have completely solved the 
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>  
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
> open the attachment "Chinese-Notes.one" and check whether the Chinese content 
> in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4303) Unable to extract Chinese content in onenote

2024-08-29 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4303:
--
Affects Version/s: 2.9.2

> Unable to extract Chinese content in onenote
> 
>
> Key: TIKA-4303
> URL: https://issues.apache.org/jira/browse/TIKA-4303
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.8.0, 2.9.2
>Reporter: lqangi
>Priority: Major
> Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using 
> tika, the Chinese part of the file could not be extracted, only the 
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in 
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
> extract the historical version of the data along with the extraction, I don't 
> know if this issue (TIKA-3970) has been fixed (I see that the code has been 
> committed on github, But it doesn't seem to have completely solved the 
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>  
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
> open the attachment "Chinese-Notes.one" and check whether the Chinese content 
> in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote

2024-08-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877631#comment-17877631
 ] 

Tilman Hausherr commented on TIKA-4303:
---

I tried with the 3 beta and there I get more:
  
中文标题�
�
中文标题�
中文标题�
zhongwen�
中文标题�
中文标题�
中文标题�
中文标题�
�
14:08
zhongwen
zhongwen�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08

So maybe changes were done in 3.0 but not committed to 2.9.

> Unable to extract Chinese content in onenote
> 
>
> Key: TIKA-4303
> URL: https://issues.apache.org/jira/browse/TIKA-4303
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.8.0
>Reporter: lqangi
>Priority: Major
> Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using 
> tika, the Chinese part of the file could not be extracted, only the 
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in 
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
> extract the historical version of the data along with the extraction, I don't 
> know if this issue (TIKA-3970) has been fixed (I see that the code has been 
> committed on github, But it doesn't seem to have completely solved the 
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>  
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
> open the attachment "Chinese-Notes.one" and check whether the Chinese content 
> in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4303) Unable to extract Chinese content in onenote

2024-08-29 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877631#comment-17877631
 ] 

Tilman Hausherr edited comment on TIKA-4303 at 8/29/24 8:46 AM:


I tried with the 3 beta and there I get more:
  
中文标题�
�
中文标题�
中文标题�
zhongwen�
中文标题�
中文标题�
中文标题�
中文标题�
�
14:08
zhongwen
zhongwen�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08

So maybe changes were done in 3.0 but not committed to 2.9 (where I did not get 
chinese text)


was (Author: tilman):
I tried with the 3 beta and there I get more:
  
中文标题�
�
中文标题�
中文标题�
zhongwen�
中文标题�
中文标题�
中文标题�
中文标题�
�
14:08
zhongwen
zhongwen�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08
中文标题�
Type information into a notebook or insert information from other apps and web 
pages.
OneNote is a digital notebook that automatically saves and syncs notes as you 
work.
Follow up easily with highlights and tags.
Take handwritten notes or draw ideas.
Access the notebook from any device.
Share notebooks to collaborate with others.
14:08

So maybe changes were done in 3.0 but not committed to 2.9.

> Unable to extract Chinese content in onenote
> 
>
> Key: TIKA-4303
> URL: https://issues.apache.org/jira/browse/TIKA-4303
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.8.0
>Reporter: lqangi
>Priority: Major
> Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png
>
>
> When I tried to extract the contents of onenote file containing Chinese using 
> tika, the Chinese part of the file could not be extracted, only the 
> non-Chinese content was extracted.
> In addition, some of the extracted content is duplicate, as described in 
> [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to 
> extract the historical version of the data along with the extraction, I don't 
> know if this issue (TIKA-3970) has been fixed (I see that the code has been 
> committed on github, But it doesn't seem to have completely solved the 
> problem yet)
> The software versions I use are as follows:
> Tika: 2.8.0
> Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761)
>  
> In order to reproduce this problem, just use the 2.8.0 version of Tika App to 
> open the attachment "Chinese-Notes.one" and check whether the Chinese content 
> in the file is extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4302) Please generate a new 2.9.x deployment

2024-08-28 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4302.
-
Resolution: Duplicate

> Please generate a new 2.9.x deployment
> --
>
> Key: TIKA-4302
> URL: https://issues.apache.org/jira/browse/TIKA-4302
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Alan Klein
>Priority: Major
>
> It appears that a number of dependencies were updated in TIKA-4166
> Would you be able to generate a new 2.9.x deployment that includes the 
> changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 
> (High) which is due to Bouncy Castle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4302) Please generate a new 2.9.x deployment

2024-08-28 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4302:
--
Fix Version/s: (was: TIKA-4239)

> Please generate a new 2.9.x deployment
> --
>
> Key: TIKA-4302
> URL: https://issues.apache.org/jira/browse/TIKA-4302
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Alan Klein
>Priority: Major
>
> It appears that a number of dependencies were updated in TIKA-4166
> Would you be able to generate a new 2.9.x deployment that includes the 
> changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 
> (High) which is due to Bouncy Castle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4302) Please generate a new 2.9.x deployment

2024-08-28 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened TIKA-4302:
---

> Please generate a new 2.9.x deployment
> --
>
> Key: TIKA-4302
> URL: https://issues.apache.org/jira/browse/TIKA-4302
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Alan Klein
>Priority: Major
> Fix For: TIKA-4239
>
>
> It appears that a number of dependencies were updated in TIKA-4166
> Would you be able to generate a new 2.9.x deployment that includes the 
> changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 
> (High) which is due to Bouncy Castle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4302) Please generate a new 2.9.x deployment

2024-08-28 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4302.
-
Fix Version/s: TIKA-4239
   Resolution: Duplicate

I don't know the date but there will definitively be another 2 release. I'm 
closing this issue as duplicate of TIKA-4239. See also the tika homepage for 
how we did end of life for 1, there were several 1 releases while 2 was 
released.

> Please generate a new 2.9.x deployment
> --
>
> Key: TIKA-4302
> URL: https://issues.apache.org/jira/browse/TIKA-4302
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Alan Klein
>Priority: Major
> Fix For: TIKA-4239
>
>
> It appears that a number of dependencies were updated in TIKA-4166
> Would you be able to generate a new 2.9.x deployment that includes the 
> changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 
> (High) which is due to Bouncy Castle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4239) Update to 2.9.3

2024-08-28 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877407#comment-17877407
 ] 

Tilman Hausherr commented on TIKA-4239:
---

I've modified the tika-branch2x-jdk11 build job so that it creates a JIRA 
comment like already done with the trunk.

> Update to 2.9.3
> ---
>
> Key: TIKA-4239
> URL: https://issues.apache.org/jira/browse/TIKA-4239
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4302) Please generate a new 2.9.x deployment

2024-08-28 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877298#comment-17877298
 ] 

Tilman Hausherr commented on TIKA-4302:
---

I looked at CVE-2024-29857. The "worst" that could happen is high CPU load.

> Please generate a new 2.9.x deployment
> --
>
> Key: TIKA-4302
> URL: https://issues.apache.org/jira/browse/TIKA-4302
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Alan Klein
>Priority: Major
>
> It appears that a number of dependencies were updated in TIKA-4166
> Would you be able to generate a new 2.9.x deployment that includes the 
> changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 
> (High) which is due to Bouncy Castle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4302) Please generate a new 2.9.x deployment

2024-08-27 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877242#comment-17877242
 ] 

Tilman Hausherr commented on TIKA-4302:
---

Snapshots are here:
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/2.9.3-SNAPSHOT/

> Please generate a new 2.9.x deployment
> --
>
> Key: TIKA-4302
> URL: https://issues.apache.org/jira/browse/TIKA-4302
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.2
>Reporter: Alan Klein
>Priority: Major
>
> It appears that a number of dependencies were updated in TIKA-4166
> Would you be able to generate a new 2.9.x deployment that includes the 
> changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 
> (High) which is due to Bouncy Castle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-08-26 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4231.
-
Resolution: Duplicate

Closing as duplicate. You can still comment.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
>  Labels: ActualText
> Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, 
> arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876410#comment-17876410
 ] 

Tilman Hausherr edited comment on TIKA-4280 at 8/24/24 9:52 AM:


re hdf5 I've created a ticket

[https://github.com/bytedeco/javacpp-presets/issues/1533]

Before that I contacted the hdf5 people but from their answer and from 
[https://bytedeco.org|https://bytedeco.org/] I think that they're not the ones 
responsible.


was (Author: tilman):
re hdf5 I've created a ticket

[https://github.com/bytedeco/javacpp-presets/issues/1533]

Before that I conntacted the hdf5 people but from their answer and from 
https://bytedeco.org I think that they're not the ones responsible.

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876410#comment-17876410
 ] 

Tilman Hausherr commented on TIKA-4280:
---

re hdf5 I've created a ticket

[https://github.com/bytedeco/javacpp-presets/issues/1533]

Before that I conntacted the hdf5 people but from their answer and from 
https://bytedeco.org I think that they're not the ones responsible.

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-21 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875620#comment-17875620
 ] 

Tilman Hausherr commented on TIKA-4280:
---

I just reverted the collections "-M" version, I think it was me who set it just 
so that project gets more tests. Re tika-dl lets keep it as it is, it has been 
with non regular releases since 2018 (TIKA-2672).

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-21 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875620#comment-17875620
 ] 

Tilman Hausherr edited comment on TIKA-4280 at 8/21/24 7:04 PM:


I just reverted the collections "-M" version, I think it was me who set it just 
so that this other project gets more tests. Re tika-dl lets keep it as it is, 
it has been with non regular releases since 2018 (TIKA-2672).


was (Author: tilman):
I just reverted the collections "-M" version, I think it was me who set it just 
so that project gets more tests. Re tika-dl lets keep it as it is, it has been 
with non regular releases since 2018 (TIKA-2672).

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-21 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875592#comment-17875592
 ] 

Tilman Hausherr commented on TIKA-4280:
---

TIKA-4290 is resolved, although he's of course free to bring up more changes 
but he has now kept quiet for some time.

the ffmpeg issue and the hdf5 issue: 1.14.3-1.5.10 is the latest version on 
maven central but it has a CVE. They claim it has been fixed in 1.14.4

[https://www.hdfgroup.org/2024/05/06/new-hdf5-cve-issues-fixed-in-1-14-4/]

but that one isn't available. ffmpeg has also a CVE, I've excluded it 
completely, see my comment in  tika-parsers/tika-parsers-ml/tika-dl/pom.xml . 
At this time it is still at the vulnerable 6.1.1-1.5.10 . Do we have a 
"stakeholder" on these two issues who can help?

> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction

2024-08-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874491#comment-17874491
 ] 

Tilman Hausherr commented on TIKA-3858:
---

Fixed in PDFBOX-5868.

>  Ligatures convert on text extraction
> -
>
> Key: TIKA-3858
> URL: https://issues.apache.org/jira/browse/TIKA-3858
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
> Environment: win 8, jre 1.5
>Reporter: tom hill
>Priority: Major
>  Labels: ActualText
> Attachments: TikaChromeInboxLigature.pdf
>
>
> It appears that the issue in TIKA-1289 is still present. Ligatures get 
> replaced by a question mark.
> As a particular example, the ft ligature is getting replaced by utf-8: ef bf  
> bd
> Is there any new resolution on this issue? Just returning the fl ligature 
> would be great, or normalizing it to f, t.
> This particular example comes from saving my gmail inbox page as a pdf, in 
> chrome. It uses the ft ligature in the word "Drafts".
> There are many similar examples, it's not specific to one pdf generator. 
> I'm using tika-app-2.4.1.jar 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-08-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874464#comment-17874464
 ] 

Tilman Hausherr edited comment on TIKA-4231 at 8/17/24 9:21 AM:


Here's a new text extraction after fixing PDFBOX-5868:  
[^TIKA-4231-arabic-new.txt] does this look closer to what you're expecting?


was (Author: tilman):
Here's a new text extraction:  [^TIKA-4231-arabic-new.txt] does this look 
closer to what you're expecting?

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, 
> arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-08-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874464#comment-17874464
 ] 

Tilman Hausherr commented on TIKA-4231:
---

Here's a new text extraction:  [^TIKA-4231-arabic-new.txt] does this look 
closer to what you're expecting?

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, 
> arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-08-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4231:
--
Attachment: TIKA-4231-arabic-new.txt

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, 
> arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name

2024-08-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873949#comment-17873949
 ] 

Tilman Hausherr commented on TIKA-4298:
---

The problem is that this image might be considered to be a work of art. Your 
colleague didn't sign an ICLA. IMHO there might be two solutions: 1) you 
recreate the zip file without the image 2) you change the test so that it loads 
the zip file from the URL in the ticket. (2) is done a lot in PDFBox but I 
haven't seen it in tika.

> Failed to detect charset for zip entry with short non-Unicode file name
> ---
>
> Key: TIKA-4298
> URL: https://issues.apache.org/jira/browse/TIKA-4298
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Reporter: Mingchun Zhao
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
>  content="org.apache.tika.parser.pkg.PackageParser"/>
> 
> 
> 
> 
> 
> 
> 
> 
> shiba.png
> 
> 
> ���1.txt
> あいうえお
> かきくけこ
> 
> 
> ���2.txt
> さしすせそ
> たちつてと
> 
> % {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name

2024-08-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873921#comment-17873921
 ] 

Tilman Hausherr commented on TIKA-4298:
---

I already tested it locally - nice. But what's with the ZIP file? Is this from 
the wild, or did you create it yourself? Who has the copyright of the shiba.png 
image?

> Failed to detect charset for zip entry with short non-Unicode file name
> ---
>
> Key: TIKA-4298
> URL: https://issues.apache.org/jira/browse/TIKA-4298
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Reporter: Mingchun Zhao
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
>  content="org.apache.tika.parser.pkg.PackageParser"/>
> 
> 
> 
> 
> 
> 
> 
> 
> shiba.png
> 
> 
> ���1.txt
> あいうえお
> かきくけこ
> 
> 
> ���2.txt
> さしすせそ
> たちつてと
> 
> % {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name

2024-08-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4298:
--
Fix Version/s: 3.0.0
   2.9.3

> Failed to detect charset for zip entry with short non-Unicode file name
> ---
>
> Key: TIKA-4298
> URL: https://issues.apache.org/jira/browse/TIKA-4298
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Reporter: Mingchun Zhao
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
>  content="org.apache.tika.parser.pkg.PackageParser"/>
> 
> 
> 
> 
> 
> 
> 
> 
> shiba.png
> 
> 
> ���1.txt
> あいうえお
> かきくけこ
> 
> 
> ���2.txt
> さしすせそ
> たちつてと
> 
> % {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4290) Fix code inspection anomalies

2024-08-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4290.
---
Resolution: Fixed

> Fix code inspection anomalies
> -
>
> Key: TIKA-4290
> URL: https://issues.apache.org/jira/browse/TIKA-4290
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32

2024-08-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4296.
---
Resolution: Fixed

> "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
> -
>
> Key: TIKA-4296
> URL: https://issues.apache.org/jira/browse/TIKA-4296
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Thomas Mortagne
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: pdf.pdf
>
>
> I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of 
> a pdf file seems to produce the following warning:
> {noformat}
> WARN  o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but 
> is -1
> {noformat}
> The behavior is the same as with 2.0.31, it's just that pdfbox is apparently 
> not too happy anymore with the way it's used by Tika.
> This new warning was apparently introduced by PDFBOX-5822.
> Just in case it's not actually any file, here is one with which I reproduce:  
> [^pdf.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32

2024-08-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871740#comment-17871740
 ] 

Tilman Hausherr commented on TIKA-4296:
---

I'll have to wait until PDFBox 3.0.3 is released (very soon).

> "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
> -
>
> Key: TIKA-4296
> URL: https://issues.apache.org/jira/browse/TIKA-4296
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Thomas Mortagne
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: pdf.pdf
>
>
> I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of 
> a pdf file seems to produce the following warning:
> {noformat}
> WARN  o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but 
> is -1
> {noformat}
> The behavior is the same as with 2.0.31, it's just that pdfbox is apparently 
> not too happy anymore with the way it's used by Tika.
> This new warning was apparently introduced by PDFBOX-5822.
> Just in case it's not actually any file, here is one with which I reproduce:  
> [^pdf.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32

2024-08-07 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened TIKA-4296:
---

I reverted the trunk to investigate the test failures.

> "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
> -
>
> Key: TIKA-4296
> URL: https://issues.apache.org/jira/browse/TIKA-4296
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Thomas Mortagne
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: pdf.pdf
>
>
> I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of 
> a pdf file seems to produce the following warning:
> {noformat}
> WARN  o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but 
> is -1
> {noformat}
> The behavior is the same as with 2.0.31, it's just that pdfbox is apparently 
> not too happy anymore with the way it's used by Tika.
> This new warning was apparently introduced by PDFBOX-5822.
> Just in case it's not actually any file, here is one with which I reproduce:  
> [^pdf.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32

2024-08-07 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4296.
---
Resolution: Fixed

Thanks, this will be fixed in the next version. It's really just a warning.

> "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
> -
>
> Key: TIKA-4296
> URL: https://issues.apache.org/jira/browse/TIKA-4296
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Thomas Mortagne
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: pdf.pdf
>
>
> I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of 
> a pdf file seems to produce the following warning:
> {noformat}
> WARN  o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but 
> is -1
> {noformat}
> The behavior is the same as with 2.0.31, it's just that pdfbox is apparently 
> not too happy anymore with the way it's used by Tika.
> This new warning was apparently introduced by PDFBOX-5822.
> Just in case it's not actually any file, here is one with which I reproduce:  
> [^pdf.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32

2024-08-07 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned TIKA-4296:
-

Assignee: Tilman Hausherr

> "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
> -
>
> Key: TIKA-4296
> URL: https://issues.apache.org/jira/browse/TIKA-4296
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Thomas Mortagne
>Assignee: Tilman Hausherr
>Priority: Major
> Attachments: pdf.pdf
>
>
> I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of 
> a pdf file seems to produce the following warning:
> {noformat}
> WARN  o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but 
> is -1
> {noformat}
> The behavior is the same as with 2.0.31, it's just that pdfbox is apparently 
> not too happy anyore with the way it's used by Tika.
> This new warning was apparently introduced by PDFBOX-5822.
> Just in case it's not actually any file, here is one with which I reproduce:  
> [^pdf.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32

2024-08-07 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4296:
--
Fix Version/s: 3.0.0
   2.9.3

> "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
> -
>
> Key: TIKA-4296
> URL: https://issues.apache.org/jira/browse/TIKA-4296
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Thomas Mortagne
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
> Attachments: pdf.pdf
>
>
> I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of 
> a pdf file seems to produce the following warning:
> {noformat}
> WARN  o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but 
> is -1
> {noformat}
> The behavior is the same as with 2.0.31, it's just that pdfbox is apparently 
> not too happy anyore with the way it's used by Tika.
> This new warning was apparently introduced by PDFBOX-5822.
> Just in case it's not actually any file, here is one with which I reproduce:  
> [^pdf.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4247) HttpFetcher - add ability to send request headers

2024-08-07 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4247:
--
Fix Version/s: 3.0.0

> HttpFetcher - add ability to send request headers
> -
>
> Key: TIKA-4247
> URL: https://issues.apache.org/jira/browse/TIKA-4247
> Project: Tika
>  Issue Type: New Feature
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> add ability to send request headers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-08-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4252.
---
Resolution: Fixed

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871132#comment-17871132
 ] 

Tilman Hausherr edited comment on TIKA-4294 at 8/5/24 5:36 PM:
---

What I mean is that if its name is equal to the superclass, then the result is 
the superclass. Else if not equal, then the class is created from the 
superclass name and again, the result is the superclass. 
{code}
Class superClazz = Class.forName(superClassName);
{code}
would be the same.

After writing this I googled... seems that yes it does take time, then your 
code should as it is

https://stackoverflow.com/questions/18231991/class-forname-caching
https://stackoverflow.com/questions/25967441/difference-between-calling-a-class-constructor-and-using-class-forname-newinst


was (Author: tilman):
What I mean is that if its name is equal to the superclass, then the result is 
the superclass. Else if not equal, then the class is created from the 
superclass name and again, the result is the superclass. 
{code}
Class superClazz = Class.forName(superClassName);
{code}
would be the same.

After writing this I googled... seems that yes it does take time, then your 
code should stay that way

https://stackoverflow.com/questions/18231991/class-forname-caching
https://stackoverflow.com/questions/25967441/difference-between-calling-a-class-constructor-and-using-class-forname-newinst

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871132#comment-17871132
 ] 

Tilman Hausherr commented on TIKA-4294:
---

What I mean is that if its name is equal to the superclass, then the result is 
the superclass. Else if not equal, then the class is created from the 
superclass name and again, the result is the superclass. 
{code}
Class superClazz = Class.forName(superClassName);
{code}
would be the same.

After writing this I googled... seems that yes it does take time, then your 
code should stay that way

https://stackoverflow.com/questions/18231991/class-forname-caching
https://stackoverflow.com/questions/25967441/difference-between-calling-a-class-constructor-and-using-class-forname-newinst

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext

2024-08-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871127#comment-17871127
 ] 

Tilman Hausherr commented on TIKA-4294:
---

that alternative is still there:
{code}
Class superClazz = className.equals(superClassName) ? clazz : 
Class.forName(superClassName);
{code}
The result will always be the {{superClassName}} class. It only makes sense if 
you'd assume that {{Class.forName}} is a very slow operation (I don't know if 
it is), or could fail for security reasons. Or I'm missing something here.

> Simplify serialization/deserialization of ParseContext
> --
>
> Key: TIKA-4294
> URL: https://issues.apache.org/jira/browse/TIKA-4294
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the 
> serialization and deserialization of ParseContext to avoid redundancy of the 
> superclass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name

2024-08-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4291.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

Thanks!

> In JDBCEmitter local var dateFormats shadows class filed with the same name
> ---
>
> Key: TIKA-4291
> URL: https://issues.apache.org/jira/browse/TIKA-4291
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: Dmitrii Kriukov
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> Line 338 of  JDBCEmitter
> Local variable dateFormats is created, populated with values, but never used 
> in its scope.
> It's not clear how to fix. Was it planned to use class field with the same 
> type and name?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name

2024-08-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4291:
--
Fix Version/s: 3.0.0
   2.9.3

> In JDBCEmitter local var dateFormats shadows class filed with the same name
> ---
>
> Key: TIKA-4291
> URL: https://issues.apache.org/jira/browse/TIKA-4291
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: Dmitrii Kriukov
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> Line 338 of  JDBCEmitter
> Local variable dateFormats is created, populated with values, but never used 
> in its scope.
> It's not clear how to fix. Was it planned to use class field with the same 
> type and name?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name

2024-08-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4291:
--
Affects Version/s: 2.9.2
   3.0.0-BETA

> In JDBCEmitter local var dateFormats shadows class filed with the same name
> ---
>
> Key: TIKA-4291
> URL: https://issues.apache.org/jira/browse/TIKA-4291
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 3.0.0-BETA, 2.9.2
>Reporter: Dmitrii Kriukov
>Priority: Major
>
> Line 338 of  JDBCEmitter
> Local variable dateFormats is created, populated with values, but never used 
> in its scope.
> It's not clear how to fix. Was it planned to use class field with the same 
> type and name?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name

2024-08-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871083#comment-17871083
 ] 

Tilman Hausherr commented on TIKA-4291:
---

This was done in TIKA-3916 / TIKA-3930. I think that code was moved to the top 
in the constructor and then it was forgotten to delete it. Also, the left index 
was always the same.
ping [~tallison]

> In JDBCEmitter local var dateFormats shadows class filed with the same name
> ---
>
> Key: TIKA-4291
> URL: https://issues.apache.org/jira/browse/TIKA-4291
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Reporter: Dmitrii Kriukov
>Priority: Major
>
> Line 338 of  JDBCEmitter
> Local variable dateFormats is created, populated with values, but never used 
> in its scope.
> It's not clear how to fix. Was it planned to use class field with the same 
> type and name?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4292) Mismatched type in contains() calls in OneNoteTreeWalker

2024-08-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4292.
---
Fix Version/s: 3.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

Thank you, fixed.

> Mismatched type in contains() calls in OneNoteTreeWalker
> 
>
> Key: TIKA-4292
> URL: https://issues.apache.org/jira/browse/TIKA-4292
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 3.0.0-BETA
>Reporter: Dmitrii Kriukov
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0
>
>
> lines 472 499 - Set can't contain instances of 
> OneNotePropertyId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4292) Mismatched type in contains() calls in OneNoteTreeWalker

2024-08-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4292:
--
Affects Version/s: 3.0.0-BETA

> Mismatched type in contains() calls in OneNoteTreeWalker
> 
>
> Key: TIKA-4292
> URL: https://issues.apache.org/jira/browse/TIKA-4292
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Dmitrii Kriukov
>Priority: Major
>
> lines 472 499 - Set can't contain instances of 
> OneNotePropertyId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4292) Mismatched type in contains() calls in OneNoteTreeWalker

2024-08-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4292:
--
Component/s: parser

> Mismatched type in contains() calls in OneNoteTreeWalker
> 
>
> Key: TIKA-4292
> URL: https://issues.apache.org/jira/browse/TIKA-4292
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 3.0.0-BETA
>Reporter: Dmitrii Kriukov
>Priority: Major
>
> lines 472 499 - Set can't contain instances of 
> OneNotePropertyId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4293) Mismatched type in contains() calls in StreamingDetectContext

2024-08-05 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4293.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

Thanks, fixed. Ignore the Hudson entry. The CI is very unstable currently.

> Mismatched type in contains() calls in StreamingDetectContext
> -
>
> Key: TIKA-4293
> URL: https://issues.apache.org/jira/browse/TIKA-4293
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Dmitrii Kriukov
>    Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> line 80
> Map may not contain keys of type Class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4293) Mismatched type in contains() calls in StreamingDetectContext

2024-08-04 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4293:
--
Affects Version/s: 2.9.2

> Mismatched type in contains() calls in StreamingDetectContext
> -
>
> Key: TIKA-4293
> URL: https://issues.apache.org/jira/browse/TIKA-4293
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Dmitrii Kriukov
>Priority: Major
>
> line 80
> Map may not contain keys of type Class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4293) Mismatched type in contains() calls in StreamingDetectContext

2024-08-04 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4293:
--
Fix Version/s: 3.0.0
   2.9.3

> Mismatched type in contains() calls in StreamingDetectContext
> -
>
> Key: TIKA-4293
> URL: https://issues.apache.org/jira/browse/TIKA-4293
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Dmitrii Kriukov
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> line 80
> Map may not contain keys of type Class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4290) Fix code inspection anonalies

2024-08-04 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4290:
--
Fix Version/s: 3.0.0
   2.9.3

> Fix code inspection anonalies
> -
>
> Key: TIKA-4290
> URL: https://issues.apache.org/jira/browse/TIKA-4290
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4280) Tasks for the 3.0.0 release

2024-08-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4280:
--
Description: 
I'm too lazy to open separate tickets. Please do so if desired.

Some items:
* Before releasing the real 3.0.0 we need to remove any "-M" dependencies
* Decide about the ffmpeg issue and the hdf5 issue
* Run the regression tests vs 2.9.x
* Convert tika-grpc to use the dependency plugin instead of the shade plugin
* Turn javadocs back on. I got errors during the deploy process because javadoc 
needed the auto-generated code ("cannot find symbol DeleteFetcherRequest"). We 
need to enable javadocs for the rest of the project.
* TIKA-4290 Tilman question

Other things? Thank you [~tilman] for the first two!

  was:
I'm too lazy to open separate tickets. Please do so if desired.

Some items:
* Before releasing the real 3.0.0 we need to remove any "-M" dependencies
* Decide about the ffmpeg issue and the hdf5 issue
* Run the regression tests vs 2.9.x
* Convert tika-grpc to use the dependency plugin instead of the shade plugin
* Turn javadocs back on. I got errors during the deploy process because javadoc 
needed the auto-generated code ("cannot find symbol DeleteFetcherRequest"). We 
need to enable javadocs for the rest of the project.

Other things? Thank you [~tilman] for the first two!


> Tasks for the 3.0.0 release
> ---
>
> Key: TIKA-4280
> URL: https://issues.apache.org/jira/browse/TIKA-4280
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I'm too lazy to open separate tickets. Please do so if desired.
> Some items:
> * Before releasing the real 3.0.0 we need to remove any "-M" dependencies
> * Decide about the ffmpeg issue and the hdf5 issue
> * Run the regression tests vs 2.9.x
> * Convert tika-grpc to use the dependency plugin instead of the shade plugin
> * Turn javadocs back on. I got errors during the deploy process because 
> javadoc needed the auto-generated code ("cannot find symbol 
> DeleteFetcherRequest"). We need to enable javadocs for the rest of the 
> project.
> * TIKA-4290 Tilman question
> Other things? Thank you [~tilman] for the first two!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-08-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened TIKA-4252:
---

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-08-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870807#comment-17870807
 ] 

Tilman Hausherr commented on TIKA-4252:
---

Please have a look at PR# 1872. Even with the proposed correction of
{code}
Class superClazz = clazz.equals(superClassName) ? clazz : 
Class.forName(superClassName);
{code}
to
{code}
Class superClazz = clazz.toString().equals(superClassName) ? clazz : 
Class.forName(superClassName);
{code}
superClazz would always be assigned the same value regardless how the 
alternative works out.
Also, {{clazzName}} from a few lines above is unused. I wonder if something 
completely different was intended.

> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4290) Fix code inspection anonalies

2024-08-03 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4290:
-

 Summary: Fix code inspection anonalies
 Key: TIKA-4290
 URL: https://issues.apache.org/jira/browse/TIKA-4290
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.9.2
Reporter: Tilman Hausherr






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4283:
--
Component/s: core
 parser

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>  Components: core, parser
>Affects Versions: 2.9.2
>Reporter: Lonzak
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4283:
--
Affects Version/s: 2.9.2

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>Affects Versions: 2.9.2
>Reporter: Lonzak
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4283.
---
  Assignee: Tilman Hausherr
Resolution: Fixed

Done, it's now in 2.* as well, thanks.

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>  Components: core, parser
>Affects Versions: 2.9.2
>Reporter: Lonzak
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore

2024-07-24 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4283:
--
Fix Version/s: 3.0.0

> Add detection for JKS Keystore
> --
>
> Key: TIKA-4283
> URL: https://issues.apache.org/jira/browse/TIKA-4283
> Project: Tika
>  Issue Type: New Feature
>Reporter: Lonzak
>Priority: Major
> Fix For: 3.0.0, 2.9.3
>
>
> I added detection for java keystores JKS. It is based on the magic byte.
>  
> Some additional infos:
> [https://en.wikipedia.org/wiki/Java_KeyStore]
> The magic bytes are described here: 
> [https://en.wikipedia.org/wiki/List_of_file_signatures]
>  
> A proprietary keystore implementation provided by SUN.
> [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation]
>  
> If possible this should be added to 2.9.X Branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4285) Invalid Link for changelog CHANGES.txt files

2024-07-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867684#comment-17867684
 ] 

Tilman Hausherr commented on TIKA-4285:
---

Additionally: the 3.0.0-BETA2 link works, however the text mentions "Tika 
2.9.2".

> Invalid Link for changelog CHANGES.txt files
> 
>
> Key: TIKA-4285
> URL: https://issues.apache.org/jira/browse/TIKA-4285
> Project: Tika
>  Issue Type: Task
>Affects Versions: 2.9.0, 2.9.1, 2.9.2
>Reporter: Lonzak
>Priority: Major
>
> On the tika [start page|https://tika.apache.org/] the linked change log files 
> CHANGES.txt starting with version 2.9.0 are missing/broken.
>  
> {+}Working{+}:
> https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt
> +Not working:+
> https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt
> https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt
> https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13

2024-07-19 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4284.
-
Resolution: Invalid

> [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and 
> strudl.0.3.13
> ---
>
> Key: TIKA-4284
> URL: https://issues.apache.org/jira/browse/TIKA-4284
> Project: Tika
>  Issue Type: Bug
>Reporter: Abhijit Rajwade
>Priority: Major
>  Labels: SECURITY
>
> CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : 
> [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.
> ===
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  strudl.0.3.13 : [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13

2024-07-19 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867236#comment-17867236
 ] 

Tilman Hausherr commented on TIKA-4284:
---

How is this related to Tika? What subproject uses activemq-osgi-5.17.6 and 
strudl.0.3.13?

> [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and 
> strudl.0.3.13
> ---
>
> Key: TIKA-4284
> URL: https://issues.apache.org/jira/browse/TIKA-4284
> Project: Tika
>  Issue Type: Bug
>Reporter: Abhijit Rajwade
>Priority: Major
>  Labels: SECURITY
>
> CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : 
> [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.
> ===
> Description :
> Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5
> Weakness :Sonatype CWE: 400
> Source :  National Vulnerability Database
> Categories :  Data
> Description from CVE :An issue was discovered in the stripTags and 
> unescapeHTML components in Prototype 1.7.3 where an attacker can cause a 
> Regular Expression Denial of Servicethrough stripping crafted HTML tags.
> Explanation : The prototype package is vulnerable to Regular Expression 
> Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js 
> file used to unescape HTML fails to efficiently parse and remove tags within 
> a given string. An attacker can exploit this vulnerability by submitting a 
> crafted code block which, when parsed by the affected function, will exhaust 
> system resources and trigger a DoS condition.
> Detection :   The application is vulnerable by using this component.
> Recommendation :  There is no non-vulnerable upgrade path for this 
> component/package. We recommend investigating alternative components or a 
> potential mitigating control.
> Root Cause :  strudl.0.3.13 : [ , ]
> Advisories :  Attack: https://github.com/AlyxRen/prototype.node.js
> CVSS Details :CVE CVSS 3: 7.5CVSS Vector: 
> CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
> CVE : CVE-2020-27511
> URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511
> Remediation : This component does not have any non-vulnerable Version. Please 
> contact the vendor to get this vulnerability fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Description: 
The latest h2 version (which needs jdk11) brings a syntax error because of an 
unneeded comma in one SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106

  was:
The latest h2 (which needs jdk11) version brings a syntax error because of an 
unneeded comma in one SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106


> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> The latest h2 version (which needs jdk11) brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4282.
---
Resolution: Fixed

> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4282:
-

 Summary: Syntax error with h2 version 2.3.230
 Key: TIKA-4282
 URL: https://issues.apache.org/jira/browse/TIKA-4282
 Project: Tika
  Issue Type: Bug
  Components: tika-eval
Affects Versions: 3.0.0-BETA
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 3.0.0


The latest h2 version brings a syntax error because of an unneeded comma in one 
SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Description: 
The latest h2 (which needs jdk11) version brings a syntax error because of an 
unneeded comma in one SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106

  was:
The latest h2 version brings a syntax error because of an unneeded comma in one 
SQL query.

release notes:
https://github.com/h2database/h2database/releases/tag/version-2.3.230

likely this:
https://github.com/h2database/h2database/issues/3106


> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Affects Version/s: 2.9.2

> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230

2024-07-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4282:
--
Fix Version/s: 2.9.3

> Syntax error with h2 version 2.3.230
> 
>
> Key: TIKA-4282
> URL: https://issues.apache.org/jira/browse/TIKA-4282
> Project: Tika
>  Issue Type: Bug
>  Components: tika-eval
>Affects Versions: 3.0.0-BETA, 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0, 2.9.3
>
>
> The latest h2 (which needs jdk11) version brings a syntax error because of an 
> unneeded comma in one SQL query.
> release notes:
> https://github.com/h2database/h2database/releases/tag/version-2.3.230
> likely this:
> https://github.com/h2database/h2database/issues/3106



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-1155) Number Format is converted with an error

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-1155.
-
Resolution: Cannot Reproduce

Closing because it can no longer be reproduced, it has probably been fixed 
either by us or in POI. Please comment and/or reopen if you disagree.

> Number Format is converted with an error
> 
>
> Key: TIKA-1155
> URL: https://issues.apache.org/jira/browse/TIKA-1155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Evgeniy Buyanov
>Priority: Major
> Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code:Title=Source data}
> 
><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* 
> &quot;-&quot;\ _B_F_-;_-@_-"/>
> 
> 10
> -10
> {code}
> java -jar tika-app-1.4.jar test.xlsx > test.xml
> {code:Title=Result}
>   * 10 _F
>   -10 _F
> {code}
> related ASF Bugzilla – Bug 
> [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-1155) Number Format is converted with an error

2024-07-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866408#comment-17866408
 ] 

Tilman Hausherr commented on TIKA-1155:
---

Current output:
{code:xml}
Sheet1
  10
-   10
-
text



{code}
Looks like this on the screen:
 !screenshot-1.png! 

> Number Format is converted with an error
> 
>
> Key: TIKA-1155
> URL: https://issues.apache.org/jira/browse/TIKA-1155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Evgeniy Buyanov
>Priority: Major
> Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code:Title=Source data}
> 
><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* 
> &quot;-&quot;\ _B_F_-;_-@_-"/>
> 
> 10
> -10
> {code}
> java -jar tika-app-1.4.jar test.xlsx > test.xml
> {code:Title=Result}
>   * 10 _F
>   -10 _F
> {code}
> related ASF Bugzilla – Bug 
> [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-1155) Number Format is converted with an error

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1155:
--
Attachment: screenshot-1.png

> Number Format is converted with an error
> 
>
> Key: TIKA-1155
> URL: https://issues.apache.org/jira/browse/TIKA-1155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Evgeniy Buyanov
>Priority: Major
> Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code:Title=Source data}
> 
><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* 
> &quot;-&quot;\ _B_F_-;_-@_-"/>
> 
> 10
> -10
> {code}
> java -jar tika-app-1.4.jar test.xlsx > test.xml
> {code:Title=Result}
>   * 10 _F
>   -10 _F
> {code}
> related ASF Bugzilla – Bug 
> [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-3028) Failed test at SAS7BDATParserTest:112

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-3028.
-
Resolution: Cannot Reproduce

Closing for now because of no activity for years, please reopen if it still 
happens. I remember I had several problems in my early months as a committer 
with a german locale, and we did some fixes in the code and some configuration 
changes in my IDE.

> Failed test at SAS7BDATParserTest:112
> -
>
> Key: TIKA-3028
> URL: https://issues.apache.org/jira/browse/TIKA-3028
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.23
>Reporter: Wknds
>Priority: Blocker
> Attachments: Bildschirmfoto 2020-01-24 um 23.12.20.png
>
>
> Test fails at 
> SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107.
> Expected date is _01Jan1960:00:00_
> while the dates in the (untouched) test file are abbreviated by an '.' on my 
> system (please refer to the terminal output below).
> {code:java}
> // code placeholder
> [ERROR] Failures: 
> [ERROR]   
> SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107 
> 01Jan1960:00:00 not found in:
> TESTING   Record Number   Square of the Record Number Description of 
> the Row  Percent DonePercent Increment   datedatetimetime 
>0   0   This is row0 of   100%  
> 01-01-1960  01Jan.1960:00:00:01.00  00:00:011   1   This 
> is row1 of   1010% 0.0%02-01-1960  
> 01Jan.1960:00:00:10.00  00:00:032   4   This is row   
>  2 of   1020% 50.0%   17-01-1960  
> 01Jan.1960:00:01:40.00  00:00:093   9   This is row   
>  3 of   1030% 66.7%   22-03-1960  
> 01Jan.1960:00:16:40.00  00:00:274   16  This is row   
>  4 of   1040% 75.0%   13-09-1960  
> 01Jan.1960:02:46:40.00  00:01:215   25  This is row   
>  5 of   1050% 80.0%   17-09-1961  
> 02Jan.1960:03:46:40.00  00:04:036   36  This is row   
>  6 of   1060% 83.3%   20-07-1963  
> 12Jan.1960:13:46:40.00  00:12:097   49  This is row   
>  7 of   1070% 85.7%   29-07-1966  
> 25Apr.1960:17:46:40.00  00:36:278   64  This is row   
>  8 of   1080% 87.5%   20-03-1971  
> 03März1963:09:46:40.00  01:49:219   81  This is row   
>  9 of   1090% 88.9%   18-12-1977  
> 09Sep.1991:01:46:40.00  05:28:0310  100 This is row   
> 10 of   10100%90.0%   19-05-1987  
> 19Nov.2276:17:46:40.00  16:24:09
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3290) Extension reading it as eml instead of txt

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-3290:
--
Fix Version/s: (was: 1.24.1)

> Extension reading it as eml instead of txt
> --
>
> Key: TIKA-3290
> URL: https://issues.apache.org/jira/browse/TIKA-3290
> Project: Tika
>  Issue Type: Bug
>  Components: core, mime
>Affects Versions: 1.25
>Reporter: Tika User
>Priority: Major
>  Labels: tika-parsers
> Attachments: image-2021-02-22-10-13-08-447.png, 
> image-2021-02-23-12-39-00-778.png, test_sample_message.txt
>
>
> The attached file extension is reading it as eml instead of txt. With version 
> 1.24.1 it is reading it as txt and now with the upgrade to 1.25, it is 
> reading it as eml. So that while parsing we are getting mail corrupted error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-3172.
---
Fix Version/s: 1.25
 Assignee: Tim Allison
   Resolution: Fixed

> PDF Parser configuration enable auto space using tika config file
> -
>
> Key: TIKA-3172
> URL: https://issues.apache.org/jira/browse/TIKA-3172
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.25
>
>
> Need information on how to set enableAutoSpace using tika config file.
> {code:java}
> /
>   
> 
>   
> 
> 
>   
> false
>   
> 
>   
> / 
> {code}
> Above configuration is not working.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-3155) Parse Error while extracting CSV files

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-3155.
-
Resolution: Duplicate

Closing as duplicate of TIKA-4278. This isn't a CSV file by the improved logic.

> Parse Error while extracting CSV files
> --
>
> Key: TIKA-3155
> URL: https://issues.apache.org/jira/browse/TIKA-3155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
> Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>   ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>   at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>   at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>   at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866277#comment-17866277
 ] 

Tilman Hausherr commented on TIKA-4278:
---

If colon and another delimiter have been detected with the same confidence, use 
the other one.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4278:
--
Attachment: reports_csv_2.9.2_vs_2.9.3_4.tar.xz

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866147#comment-17866147
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:40 PM:


I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: false colon-separated lines. I never had any in 
decades, but a google search does find some SO questions, so I'll leave that 
there for now. We can still change it after the "big" regression tests.


was (Author: tilman):
I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now. We can still change it after the "big" regression tests.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>      Components: parser
>    Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866147#comment-17866147
 ] 

Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:24 PM:


I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now. We can still change it after the "big" regression tests.


was (Author: tilman):
I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.2
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file

2024-07-15 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866147#comment-17866147
 ] 

Tilman Hausherr commented on TIKA-4278:
---

I've now added a check that if the delimiter isn't in row zero then further 
hits later don't count. This fixes the problem that too many files are 
recognized as CSV that are not.

Only one problem left now: colon-separated lines. I never had any in decades, 
but a google search does find some SO questions, so I'll leave that there for 
now.

> TextAndCSVParser doesn't detect semicolon separated file
> 
>
> Key: TIKA-4278
> URL: https://issues.apache.org/jira/browse/TIKA-4278
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>    Affects Versions: 2.9.2
>        Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: csv, csvparser
> Fix For: 3.0.0, 2.9.3
>
> Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, 
> reports_csv_2.9.2_vs_2.9.3_3.tar.xz
>
>
> I ran the code from the attached SO issue and yes it doesn't detect semicolon 
> separated files. The reason is this line in {{TextAndCSVParser.java}}:
> {code:java}
> private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'};
> {code}
> This is later used by {{CSVSniffer}}. For some reason the other delimiters 
> (pipe, colon and semicolon) aren't in that array, although they are in 
> {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now 
> it works for semicolon.
> Can I change this by adding the missing delimiters or was there a reason that 
> I missed? Proposed change would change CSVSniffer so that delimiters is a set 
> and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >