[jira] [Updated] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] suchendra updated TIKA-3097: Attachment: samplefile.txt > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, > test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102248#comment-17102248 ] suchendra commented on TIKA-3097: - Adding one more file samplefile.txt same issue OOM (not everytime), if there are multiple hits back to back. (with attached file) > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, > test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3098) Detecting embedded image
[ https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102246#comment-17102246 ] suchendra commented on TIKA-3098: - Thank you [~tallison], will look into it. > Detecting embedded image > > > Key: TIKA-3098 > URL: https://issues.apache.org/jira/browse/TIKA-3098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Minor > Attachments: test copy.potx > > > I am trying to detect the embedded image using apache tika, I have a simple > java code and I am using EmbeddedDocumentExtractor to detect the embedded > image. > There is no image as I could see, but tika is detecting the embedded image. > I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102182#comment-17102182 ] Hudson commented on TIKA-3094: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1813 (See [https://builds.apache.org/job/Tika-trunk/1813/]) TIKA-3094: add javax.xml.bind to system packages. Fix java 11 jaxb. (bob: [https://github.com/apache/tika/commit/1810c5e627664f5b2a6e485ed6bd6d76814dd5f9]) * (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java * (edit) tika-bundle/test-bundles.xml * (edit) tika-bundle/pom.xml > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137 ] Bob Paulin edited comment on TIKA-3094 at 5/8/20, 1:02 AM: --- Looks like the jaxb error is not so much an issue with tika as it is with the test OSGi container. There's a few different ways to address the jars removed in Java 11 but the most simple I think is to just add the missing jars to the classpath and expose them to the bundle from the system packages. I do not see the error on java 11 or 8 now. was (Author: bob): Looks like the jaxb error is not so much an issue with tika as it is with the test OSGi container. There's a few different ways to address the jars removed in Java 11 but the most simple I think is to just add the missing jars to the classpath and expose them to the bundle from the system class loader. I do not see the error on java 11 or 8 now. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137 ] Bob Paulin edited comment on TIKA-3094 at 5/8/20, 1:02 AM: --- Looks like the jaxb error is not so much an issue with tika as it is with the test OSGi container. There's a few different ways to address the jars removed in Java 11 but the most simple I think is to just add the missing jars to the classpath and expose them to the bundle from the system class loader. I do not see the error on java 11 or 8 now. was (Author: bob): Looks like the jaxb error is not so much an issue with tika as it is with the test OSGi container. There's a few different ways to address the jars removed in Java 11 but the most simple I think is to just add the missing jars to the class loader and expose them to the bundle from the system class loader. I do not see the error on java 11 or 8 now. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137 ] Bob Paulin commented on TIKA-3094: -- Looks like the jaxb error is not so much an issue with tika as it is with the test OSGi container. There's a few different ways to address the jars removed in Java 11 but the most simple I think is to just add the missing jars to the class loader and expose them to the bundle from the system class loader. I do not see the error on java 11 or 8 now. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3098) Detecting embedded image
[ https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101828#comment-17101828 ] Tim Allison commented on TIKA-3098: --- This is a pretty good example of using the RecursiveParserWrapper: https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L280 > Detecting embedded image > > > Key: TIKA-3098 > URL: https://issues.apache.org/jira/browse/TIKA-3098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Minor > Attachments: test copy.potx > > > I am trying to detect the embedded image using apache tika, I have a simple > java code and I am using EmbeddedDocumentExtractor to detect the embedded > image. > There is no image as I could see, but tika is detecting the embedded image. > I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3098) Detecting embedded image
[ https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101828#comment-17101828 ] Tim Allison edited comment on TIKA-3098 at 5/7/20, 4:01 PM: This is a pretty good example of using the RecursiveParserWrapper programmatically: https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L280 was (Author: talli...@mitre.org): This is a pretty good example of using the RecursiveParserWrapper: https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L280 > Detecting embedded image > > > Key: TIKA-3098 > URL: https://issues.apache.org/jira/browse/TIKA-3098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Minor > Attachments: test copy.potx > > > I am trying to detect the embedded image using apache tika, I have a simple > java code and I am using EmbeddedDocumentExtractor to detect the embedded > image. > There is no image as I could see, but tika is detecting the embedded image. > I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101826#comment-17101826 ] Tim Allison commented on TIKA-3097: --- Sorry, I commented too soon. After more than a couple of minutes LibreOffice did eventually open this. In short, your best bet is to use the SAX parser for this file. I don't think there's much we can do on the POI side. > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3098) Detecting embedded image
[ https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101792#comment-17101792 ] suchendra edited comment on TIKA-3098 at 5/7/20, 3:43 PM: -- Where should I set it ? was (Author: suchendra): How do I achieve this in the code ? > Detecting embedded image > > > Key: TIKA-3098 > URL: https://issues.apache.org/jira/browse/TIKA-3098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Minor > Attachments: test copy.potx > > > I am trying to detect the embedded image using apache tika, I have a simple > java code and I am using EmbeddedDocumentExtractor to detect the embedded > image. > There is no image as I could see, but tika is detecting the embedded image. > I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101794#comment-17101794 ] suchendra commented on TIKA-3097: - Even I tried opening in microsoft doc, that took almost more than 2 min :( > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3098) Detecting embedded image
[ https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101792#comment-17101792 ] suchendra commented on TIKA-3098: - How do I achieve this in the code ? > Detecting embedded image > > > Key: TIKA-3098 > URL: https://issues.apache.org/jira/browse/TIKA-3098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Minor > Attachments: test copy.potx > > > I am trying to detect the embedded image using apache tika, I have a simple > java code and I am using EmbeddedDocumentExtractor to detect the embedded > image. > There is no image as I could see, but tika is detecting the embedded image. > I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101612#comment-17101612 ] Tim Allison commented on TIKA-3097: --- java -Xmx128m -jar ~/Downloads/tika-app-1.24.jar --config=tika-config.xml test.docx works just fine with the SAX parser > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101608#comment-17101608 ] Tim Allison commented on TIKA-3097: --- LibreOffice doesn't like this file... :( > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3097: -- Attachment: Screenshot from 2020-05-07 08-14-25.png > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3098) Detecting embedded image
[ https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101588#comment-17101588 ] Tim Allison edited comment on TIKA-3098 at 5/7/20, 11:55 AM: - There's a thumbnail under docProps. If you use /rmeta or the RecursiveParserWrapper, you'll see this: "embeddedRelationshipId": "/docProps/thumbnail.jpeg" We _should_ update the ooxml code to tag this image as type thumbnail: https://tika.apache.org/1.24.1/api/org/apache/tika/metadata/TikaCoreProperties.EmbeddedResourceType.html#THUMBNAIL was (Author: talli...@mitre.org): There's a thumbnail under docProps. If you use /rmeta or the RecursiveParserWrapper, you'll see this: "embeddedRelationshipId": "/docProps/thumbnail.jpeg" We _should_ update the ooxml code to tag this image as type Inline: https://tika.apache.org/1.24/api/org/apache/tika/metadata/TikaCoreProperties.EmbeddedResourceType.html#INLINE > Detecting embedded image > > > Key: TIKA-3098 > URL: https://issues.apache.org/jira/browse/TIKA-3098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Minor > Attachments: test copy.potx > > > I am trying to detect the embedded image using apache tika, I have a simple > java code and I am using EmbeddedDocumentExtractor to detect the embedded > image. > There is no image as I could see, but tika is detecting the embedded image. > I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3098) Detecting embedded image
[ https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101588#comment-17101588 ] Tim Allison commented on TIKA-3098: --- There's a thumbnail under docProps. If you use /rmeta or the RecursiveParserWrapper, you'll see this: "embeddedRelationshipId": "/docProps/thumbnail.jpeg" We _should_ update the ooxml code to tag this image as type Inline: https://tika.apache.org/1.24/api/org/apache/tika/metadata/TikaCoreProperties.EmbeddedResourceType.html#INLINE > Detecting embedded image > > > Key: TIKA-3098 > URL: https://issues.apache.org/jira/browse/TIKA-3098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Minor > Attachments: test copy.potx > > > I am trying to detect the embedded image using apache tika, I have a simple > java code and I am using EmbeddedDocumentExtractor to detect the embedded > image. > There is no image as I could see, but tika is detecting the embedded image. > I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3097) Out of memory while parsing docx
[ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101584#comment-17101584 ] Tim Allison commented on TIKA-3097: --- Uncompressed, you're looking at ~150MB for the file. xml beans on top of that add quite a bit of overhead...2 gb sounds excessive. There is a streaming option for docx and pptx:https://cwiki.apache.org/confluence/display/TIKA/MSOfficeParsers I'll take a look in the debugger later today and let you know if this is a bug or feature. > Out of memory while parsing docx > > > Key: TIKA-3097 > URL: https://issues.apache.org/jira/browse/TIKA-3097 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.24 >Reporter: suchendra >Priority: Major > Attachments: test.docx > > > I have written simple Scala code to extract the content from uploaded file > which is docx. JVM goes OOM when tika tries to parse the file. I have > configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both > with jar as well as in my code. > Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (TIKA-3096) detect image in any document
[ https://issues.apache.org/jira/browse/TIKA-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] suchendra closed TIKA-3096. --- Resolution: Invalid > detect image in any document > > > Key: TIKA-3096 > URL: https://issues.apache.org/jira/browse/TIKA-3096 > Project: Tika > Issue Type: Bug > Components: documentation, example, parser >Affects Versions: 1.23 >Reporter: suchendra >Priority: Minor > > How do I detect whether a document contains an image or not ? > val parser = new AutoDetectParser() > val handler = new ToXMLContentHandler() > parser.parse(tikaIs, handler, new Metadata, new ParseContext) > println("File Content:" + handler.toString) > > I tried using HTMLHandler and based on existence of img tag, considered file > contains image. Is there any better way to achieve this? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3098) Detecting embedded image
suchendra created TIKA-3098: --- Summary: Detecting embedded image Key: TIKA-3098 URL: https://issues.apache.org/jira/browse/TIKA-3098 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.24 Reporter: suchendra Attachments: test copy.potx I am trying to detect the embedded image using apache tika, I have a simple java code and I am using EmbeddedDocumentExtractor to detect the embedded image. There is no image as I could see, but tika is detecting the embedded image. I have attached the file for the reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3097) Out of memory while parsing docx
suchendra created TIKA-3097: --- Summary: Out of memory while parsing docx Key: TIKA-3097 URL: https://issues.apache.org/jira/browse/TIKA-3097 Project: Tika Issue Type: Bug Components: core, parser Affects Versions: 1.24 Reporter: suchendra Attachments: test.docx I have written simple Scala code to extract the content from uploaded file which is docx. JVM goes OOM when tika tries to parse the file. I have configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both with jar as well as in my code. Attached the file for reference. -- This message was sent by Atlassian Jira (v8.3.4#803005)