[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557319#comment-17557319
 ] 

Nick Burch commented on TIKA-3798:
--

Do you have a sample file that shows the problem? A thread dump showing the 
place that Tika gets stuck? Suggestions on how we can reproduce your issue?

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Mikhail Gushinets (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Gushinets updated TIKA-3798:

Attachment: MicrosoftTeams-image.png

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Mikhail Gushinets (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557323#comment-17557323
 ] 

Mikhail Gushinets commented on TIKA-3798:
-

Hi Nick! Unfortunatelly I can not provide the original file to you because of 
security reasons. 

When trying to unrar this file it shows such an error message which means 
"Checksum is not calculated right of file as there might be a change of the 
metadata"

!MicrosoftTeams-image.png!

 

Probably this file has been corrupted in some way when opening it on Linux and 
then copying it to windows and trying to process it in TIka there.

 

Unfortunately none of the logs you`re talking about can be provided cause it 
happens on remote client machine.

The symptoms are that Tika just doesn`t call our callbacks that would return 
list of parsed files for a very long time (~16 hours). 

The other archive files including rar-s work just fine or return TikaException 
if file can not be processed

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557341#comment-17557341
 ] 

Tim Allison commented on TIKA-3798:
---

There's not much we can do without an example file.  We could fuzz the junrar 
files we have in our test corpus and see if we can trigger an infinite loop.  
The issue is likely in the dependency and not fixable at the Tika level.

We've put together some thoughts on robustness of Tika: 
https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika

Basically, we need to fix the parsers and the underlying dependencies when we 
can.  However, bad things happen when processing files at scale, and you need 
to isolate parsing in a separate process.  We offer several options: 
tika-server, tika-pipes, ForkParser, PipesParser and tika-batch.


> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557343#comment-17557343
 ] 

Nick Burch commented on TIKA-3798:
--

With no file, no thread dump and no stack trace, it won't be easy to find the 
relevant code in Tika that isn't behaving properly. As everyone working on Tika 
is a volunteer, you're probably going to have to help us a bit more...

Can you talk your client through taking a Java thread dump and get them to 
share it? Can you get the file, run it yourself through Tika to trigger the 
issue and take a thread dump? Can you share the file privately with one of us?

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3798:
--
Attachment: rar-files.csv.gz

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png, rar-files.csv.gz
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557352#comment-17557352
 ] 

Tim Allison commented on TIKA-3798:
---

Turns out we have more than I thought.

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png, rar-files.csv.gz
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557407#comment-17557407
 ] 

Tim Allison commented on TIKA-3798:
---

In Tika 2.4.0, we were using junrar 7.5.1.

https://github.com/junrar/junrar/issues/73 shows infinite loops before 7.5.1
https://github.com/junrar/junrar/issues/81 is still open and has follow up 
infinite loops from fuzzing.

In short, this is somewhat of a known issue that hasn't been solved yet even in 
7.5.2 (I'm guessing, I'll test later today).


> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png, rar-files.csv.gz
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557449#comment-17557449
 ] 

Tim Allison commented on TIKA-3798:
---

The three files on junrar's #81 still cause infinite loops in 7.5.2.

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png, rar-files.csv.gz
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557469#comment-17557469
 ] 

Tim Allison commented on TIKA-3798:
---

I opened a PR to fix https://github.com/junrar/junrar/issues/81... we'll see 
where that goes.  We won't know that that is the source of this problem without 
access to the problematic junrar file.

Again, my larger point about isolating tika into a separate process is key.  We 
can and must fix the one off infinite loops, but there will be more.

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png, rar-files.csv.gz
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557469#comment-17557469
 ] 

Tim Allison edited comment on TIKA-3798 at 6/22/22 2:18 PM:


I opened a PR to fix https://github.com/junrar/junrar/issues/81... we'll see 
where that goes.  We won't know that that is the source of your problem without 
access to the problematic rar file.

Again, my larger point about isolating tika into a separate process is key.  We 
can and must fix the one off infinite loops, but there will be more.


was (Author: talli...@mitre.org):
I opened a PR to fix https://github.com/junrar/junrar/issues/81... we'll see 
where that goes.  We won't know that that is the source of this problem without 
access to the problematic junrar file.

Again, my larger point about isolating tika into a separate process is key.  We 
can and must fix the one off infinite loops, but there will be more.

> Tika hangs up with some RAR archives
> 
>
> Key: TIKA-3798
> URL: https://issues.apache.org/jira/browse/TIKA-3798
> Project: Tika
>  Issue Type: Bug
> Environment: Windows, Tika 2.4.0
>Reporter: Mikhail Gushinets
>Priority: Major
> Attachments: MicrosoftTeams-image.png, rar-files.csv.gz
>
>
> Passing to Tika rar archive might lead to hanging up.
> When trying to unrar this file manually I get this message: "Checksum is not 
> calculated right of file as there might be a change of the metadata"
> I understand that the probably reason is some kind of file corruption here 
> but it would be nice if Tika would just throw an exception in such case 
> rather than hanging up forever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3799) Refactor FuzzingCLI to use PipesParser

2022-06-22 Thread Tim Allison (Jira)
Tim Allison created TIKA-3799:
-

 Summary: Refactor FuzzingCLI to use PipesParser
 Key: TIKA-3799
 URL: https://issues.apache.org/jira/browse/TIKA-3799
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3799) Refactor FuzzingCLI to use PipesParser

2022-06-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557495#comment-17557495
 ] 

Tim Allison commented on TIKA-3799:
---

We're currently spawning a separate process for every file.  This is robust and 
simple, but we should be using the PipesParser for this purpose.

> Refactor FuzzingCLI to use PipesParser
> --
>
> Key: TIKA-3799
> URL: https://issues.apache.org/jira/browse/TIKA-3799
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (TIKA-3799) Refactor FuzzingCLI to use PipesParser

2022-06-22 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3799.
---
Fix Version/s: 2.4.2
   Resolution: Fixed

> Refactor FuzzingCLI to use PipesParser
> --
>
> Key: TIKA-3799
> URL: https://issues.apache.org/jira/browse/TIKA-3799
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3799) Refactor FuzzingCLI to use PipesParser

2022-06-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557726#comment-17557726
 ] 

Hudson commented on TIKA-3799:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #652 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/652/])
TIKA-3799 -- Refactor FuzzingCLI to use PipesParser (tallison: 
[https://github.com/apache/tika/commit/2468e43b1eed19409bfeb8749b02e0d0350d872b])
* (edit) CHANGES.txt
* (edit) tika-fuzzing/src/main/java/org/apache/tika/fuzzing/cli/FuzzingCLI.java
* (edit) tika-fuzzing/pom.xml
* (edit) 
tika-fuzzing/src/main/java/org/apache/tika/fuzzing/general/GeneralTransformer.java
* (edit) tika-fuzzing/src/test/resources/log4j2.xml
* (add) tika-fuzzing/src/test/resources/configs/tika-fuzzing-config.xml
* (edit) tika-fuzzing/src/main/resources/log4j2.xml
* (edit) 
tika-fuzzing/src/main/java/org/apache/tika/fuzzing/cli/FuzzingCLIConfig.java


> Refactor FuzzingCLI to use PipesParser
> --
>
> Key: TIKA-3799
> URL: https://issues.apache.org/jira/browse/TIKA-3799
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)