[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121157#comment-13121157 ] Jukka Zitting commented on TIKA-636: Do you still see this problem with Tika 0.10? If yes, please attach an example file that can be used to reproduce the issue. > Taking very high heap space while parsing docx - Resulting in OOM in tha app > > > Key: TIKA-636 > URL: https://issues.apache.org/jira/browse/TIKA-636 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 > Environment: Linux box > JDK 1.6 >Reporter: Jayesh K Rajpurohit > > I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for > parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM > space which is too high resulting in OOM in the application. > Do I have to tweak in at some place for reducing down the memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080988#comment-13080988 ] Jukka Zitting commented on TIKA-636: As a related point, see TIKA-416 for a solution that can be used to prevent an OOM caused by a parsing process from wreaking havoc in your JVM. Instead of reducing memory consumption, TIKA-416 sandboxes the parser to a separate JVM process where it can safely fail with OOM or other errors. > Taking very high heap space while parsing docx - Resulting in OOM in tha app > > > Key: TIKA-636 > URL: https://issues.apache.org/jira/browse/TIKA-636 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 > Environment: Linux box > JDK 1.6 >Reporter: Jayesh K Rajpurohit > > I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for > parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM > space which is too high resulting in OOM in the application. > Do I have to tweak in at some place for reducing down the memory consumption. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080955#comment-13080955 ] Nick Burch commented on TIKA-636: - Please remember - Tika is a volunteer project If this bug matters to you, please help us with working on it. As Maxim has pointed out, we'd need an event based parser for DOCX files much as we already do for XLSX. Likely the existing POI usermodel code could be used for the other streams to make life easy, but the document.xml part will want to be SAX parsed > Taking very high heap space while parsing docx - Resulting in OOM in tha app > > > Key: TIKA-636 > URL: https://issues.apache.org/jira/browse/TIKA-636 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 > Environment: Linux box > JDK 1.6 >Reporter: Jayesh K Rajpurohit > > I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for > parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM > space which is too high resulting in OOM in the application. > Do I have to tweak in at some place for reducing down the memory consumption. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080944#comment-13080944 ] Nicholas Dodd commented on TIKA-636: I am really surprised this is not scheduled for the 1.0 release. We also are seeing 500MB RAM usage for small docx files - this is simply not a shippable bug! > Taking very high heap space while parsing docx - Resulting in OOM in tha app > > > Key: TIKA-636 > URL: https://issues.apache.org/jira/browse/TIKA-636 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 > Environment: Linux box > JDK 1.6 >Reporter: Jayesh K Rajpurohit > > I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for > parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM > space which is too high resulting in OOM in the application. > Do I have to tweak in at some place for reducing down the memory consumption. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017322#comment-13017322 ] Jayesh K Rajpurohit commented on TIKA-636: -- What I meant is the Fix for the OOM issue as part of tika release ? Thanks > Taking very high heap space while parsing docx - Resulting in OOM in tha app > > > Key: TIKA-636 > URL: https://issues.apache.org/jira/browse/TIKA-636 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 > Environment: Linux box > JDK 1.6 >Reporter: Jayesh K Rajpurohit > > I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for > parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM > space which is too high resulting in OOM in the application. > Do I have to tweak in at some place for reducing down the memory consumption. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017319#comment-13017319 ] Jayesh K Rajpurohit commented on TIKA-636: -- Thanks Maxim, Yes, the number of xmlbeans objects are taking the toll. I have tried using the Local SAX Parser code parsing the word/document.xml. It spitted a String of size 3 MB for a 3MB docx (looks like there was repitition of data) but Tika only spits 100KB for that. But the native code took only 3MB So when can we expect this as part of the Tika release ?? Thanks ! > Taking very high heap space while parsing docx - Resulting in OOM in tha app > > > Key: TIKA-636 > URL: https://issues.apache.org/jira/browse/TIKA-636 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 > Environment: Linux box > JDK 1.6 >Reporter: Jayesh K Rajpurohit > > I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for > parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM > space which is too high resulting in OOM in the application. > Do I have to tweak in at some place for reducing down the memory consumption. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017311#comment-13017311 ] Maxim Valyanskiy commented on TIKA-636: --- It is known problem in POI, afaik there is no event-model for parsing docx, so we had to build complete object (xmlbean) tree to process it > Taking very high heap space while parsing docx - Resulting in OOM in tha app > > > Key: TIKA-636 > URL: https://issues.apache.org/jira/browse/TIKA-636 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 > Environment: Linux box > JDK 1.6 >Reporter: Jayesh K Rajpurohit > > I am using Tika-core-0.9 jar and poi 3.2-Final jar and poi-3.7 jars for > parsing the documents. But while parsing 3MB docx it is using 500 MB of RAM > space which is too high resulting in OOM in the application. > Do I have to tweak in at some place for reducing down the memory consumption. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira