[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-06-09 Thread suchendra (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130042#comment-17130042 ] suchendra commented on TIKA-3097: - Yeah, that's true. If I use SAX DOCX the memory footpri

[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-06-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129624#comment-17129624 ] Tim Allison commented on TIKA-3097: --- Java will take as much heap as it can use. If this

[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-06-09 Thread suchendra (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129582#comment-17129582 ] suchendra commented on TIKA-3097: - Even for the attached "txt", heap is hitting almost 700

[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-06-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129574#comment-17129574 ] Tim Allison commented on TIKA-3097: --- I'm not sure I understand the question. We don't h

[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-06-09 Thread suchendra (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129553#comment-17129553 ] suchendra commented on TIKA-3097: - [~tallison], Is there any streaming solution for all th

Re: Mime type magic and repeated similar blocks - thoughts?

2020-06-09 Thread Tim Allison
I like the regex option, and I _think_ that the anchor at the beginning (along with the lack of backtracking) shouldn't cause horrible performance degradation. On Tue, Jun 9, 2020 at 7:04 AM Nick Burch wrote: > Hi All > > At the moment, to detect RFC822 emails, we try and check for a bunch of >

Mime type magic and repeated similar blocks - thoughts?

2020-06-09 Thread Nick Burch
Hi All At the moment, to detect RFC822 emails, we try and check for a bunch of common header lines right at the start. If not, we check for a few "could be an unusual header, could be some text", followed by checking for common headers in a larger area of text below. For example, starts with