[jira] [Updated] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread suchendra (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

suchendra updated TIKA-3097:

Attachment: samplefile.txt

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, 
> test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread suchendra (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102248#comment-17102248
 ] 

suchendra commented on TIKA-3097:
-

Adding one more file samplefile.txt same issue OOM (not everytime), if there 
are multiple hits back to back. (with attached file)

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, 
> test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3098) Detecting embedded image

2020-05-07 Thread suchendra (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102246#comment-17102246
 ] 

suchendra commented on TIKA-3098:
-

Thank you [~tallison], will look into it.

> Detecting embedded image
> 
>
> Key: TIKA-3098
> URL: https://issues.apache.org/jira/browse/TIKA-3098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Minor
> Attachments: test copy.potx
>
>
> I am trying to detect the embedded image using apache tika, I have a simple 
> java code and I am using EmbeddedDocumentExtractor to detect the embedded 
> image. 
> There is no image as I could see, but tika is detecting the embedded image.
> I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-07 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102182#comment-17102182
 ] 

Hudson commented on TIKA-3094:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1813 (See 
[https://builds.apache.org/job/Tika-trunk/1813/])
TIKA-3094: add javax.xml.bind to system packages.  Fix java 11 jaxb. (bob: 
[https://github.com/apache/tika/commit/1810c5e627664f5b2a6e485ed6bd6d76814dd5f9])
* (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
* (edit) tika-bundle/test-bundles.xml
* (edit) tika-bundle/pom.xml


> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137
 ] 

Bob Paulin edited comment on TIKA-3094 at 5/8/20, 1:02 AM:
---

Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
classpath and expose them to the bundle from the system packages.  I do not see 
the error on java 11 or 8 now.


was (Author: bob):
Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
classpath and expose them to the bundle from the system class loader.  I do not 
see the error on java 11 or 8 now.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137
 ] 

Bob Paulin edited comment on TIKA-3094 at 5/8/20, 1:02 AM:
---

Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
classpath and expose them to the bundle from the system class loader.  I do not 
see the error on java 11 or 8 now.


was (Author: bob):
Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
class loader and expose them to the bundle from the system class loader.  I do 
not see the error on java 11 or 8 now.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102137#comment-17102137
 ] 

Bob Paulin commented on TIKA-3094:
--

Looks like the jaxb error is not so much an issue with tika as it is with the 
test OSGi container.  There's a few different ways to address the jars removed 
in Java 11 but the most simple I think is to just add the missing jars to the 
class loader and expose them to the bundle from the system class loader.  I do 
not see the error on java 11 or 8 now.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3098) Detecting embedded image

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101828#comment-17101828
 ] 

Tim Allison commented on TIKA-3098:
---

This is a pretty good example of using the RecursiveParserWrapper: 
https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L280

> Detecting embedded image
> 
>
> Key: TIKA-3098
> URL: https://issues.apache.org/jira/browse/TIKA-3098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Minor
> Attachments: test copy.potx
>
>
> I am trying to detect the embedded image using apache tika, I have a simple 
> java code and I am using EmbeddedDocumentExtractor to detect the embedded 
> image. 
> There is no image as I could see, but tika is detecting the embedded image.
> I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3098) Detecting embedded image

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101828#comment-17101828
 ] 

Tim Allison edited comment on TIKA-3098 at 5/7/20, 4:01 PM:


This is a pretty good example of using the RecursiveParserWrapper 
programmatically: 
https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L280


was (Author: talli...@mitre.org):
This is a pretty good example of using the RecursiveParserWrapper: 
https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L280

> Detecting embedded image
> 
>
> Key: TIKA-3098
> URL: https://issues.apache.org/jira/browse/TIKA-3098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Minor
> Attachments: test copy.potx
>
>
> I am trying to detect the embedded image using apache tika, I have a simple 
> java code and I am using EmbeddedDocumentExtractor to detect the embedded 
> image. 
> There is no image as I could see, but tika is detecting the embedded image.
> I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101826#comment-17101826
 ] 

Tim Allison commented on TIKA-3097:
---

Sorry, I commented too soon.  After more than a couple of minutes LibreOffice 
did eventually open this.  In short, your best bet is to use the SAX parser for 
this file.  I don't think there's much we can do on the POI side.

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3098) Detecting embedded image

2020-05-07 Thread suchendra (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101792#comment-17101792
 ] 

suchendra edited comment on TIKA-3098 at 5/7/20, 3:43 PM:
--

Where should I set it ?


was (Author: suchendra):
How do I achieve this in the code ? 

> Detecting embedded image
> 
>
> Key: TIKA-3098
> URL: https://issues.apache.org/jira/browse/TIKA-3098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Minor
> Attachments: test copy.potx
>
>
> I am trying to detect the embedded image using apache tika, I have a simple 
> java code and I am using EmbeddedDocumentExtractor to detect the embedded 
> image. 
> There is no image as I could see, but tika is detecting the embedded image.
> I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread suchendra (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101794#comment-17101794
 ] 

suchendra commented on TIKA-3097:
-

Even I tried opening in microsoft doc, that took almost more than 2 min :( 

 

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3098) Detecting embedded image

2020-05-07 Thread suchendra (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101792#comment-17101792
 ] 

suchendra commented on TIKA-3098:
-

How do I achieve this in the code ? 

> Detecting embedded image
> 
>
> Key: TIKA-3098
> URL: https://issues.apache.org/jira/browse/TIKA-3098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Minor
> Attachments: test copy.potx
>
>
> I am trying to detect the embedded image using apache tika, I have a simple 
> java code and I am using EmbeddedDocumentExtractor to detect the embedded 
> image. 
> There is no image as I could see, but tika is detecting the embedded image.
> I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101612#comment-17101612
 ] 

Tim Allison commented on TIKA-3097:
---

java -Xmx128m -jar ~/Downloads/tika-app-1.24.jar --config=tika-config.xml 
test.docx

works just fine with the SAX parser

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101608#comment-17101608
 ] 

Tim Allison commented on TIKA-3097:
---

LibreOffice doesn't like this file... :(

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3097:
--
Attachment: Screenshot from 2020-05-07 08-14-25.png

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3098) Detecting embedded image

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101588#comment-17101588
 ] 

Tim Allison edited comment on TIKA-3098 at 5/7/20, 11:55 AM:
-

There's a thumbnail under docProps.  If you use /rmeta or the 
RecursiveParserWrapper, you'll see this:

"embeddedRelationshipId": "/docProps/thumbnail.jpeg"

We _should_ update the ooxml code to tag this image as type thumbnail: 
https://tika.apache.org/1.24.1/api/org/apache/tika/metadata/TikaCoreProperties.EmbeddedResourceType.html#THUMBNAIL


was (Author: talli...@mitre.org):
There's a thumbnail under docProps.  If you use /rmeta or the 
RecursiveParserWrapper, you'll see this:

"embeddedRelationshipId": "/docProps/thumbnail.jpeg"

We _should_ update the ooxml code to tag this image as type Inline: 
https://tika.apache.org/1.24/api/org/apache/tika/metadata/TikaCoreProperties.EmbeddedResourceType.html#INLINE

> Detecting embedded image
> 
>
> Key: TIKA-3098
> URL: https://issues.apache.org/jira/browse/TIKA-3098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Minor
> Attachments: test copy.potx
>
>
> I am trying to detect the embedded image using apache tika, I have a simple 
> java code and I am using EmbeddedDocumentExtractor to detect the embedded 
> image. 
> There is no image as I could see, but tika is detecting the embedded image.
> I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3098) Detecting embedded image

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101588#comment-17101588
 ] 

Tim Allison commented on TIKA-3098:
---

There's a thumbnail under docProps.  If you use /rmeta or the 
RecursiveParserWrapper, you'll see this:

"embeddedRelationshipId": "/docProps/thumbnail.jpeg"

We _should_ update the ooxml code to tag this image as type Inline: 
https://tika.apache.org/1.24/api/org/apache/tika/metadata/TikaCoreProperties.EmbeddedResourceType.html#INLINE

> Detecting embedded image
> 
>
> Key: TIKA-3098
> URL: https://issues.apache.org/jira/browse/TIKA-3098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Minor
> Attachments: test copy.potx
>
>
> I am trying to detect the embedded image using apache tika, I have a simple 
> java code and I am using EmbeddedDocumentExtractor to detect the embedded 
> image. 
> There is no image as I could see, but tika is detecting the embedded image.
> I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101584#comment-17101584
 ] 

Tim Allison commented on TIKA-3097:
---

Uncompressed, you're looking at ~150MB for the file.  xml beans on top of that 
add quite a bit of overhead...2 gb sounds excessive.  There is a streaming 
option for docx and 
pptx:https://cwiki.apache.org/confluence/display/TIKA/MSOfficeParsers

I'll take a look in the debugger later today and let you know if this is a bug 
or feature.

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (TIKA-3096) detect image in any document

2020-05-07 Thread suchendra (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

suchendra closed TIKA-3096.
---
Resolution: Invalid

> detect image in any document
> 
>
> Key: TIKA-3096
> URL: https://issues.apache.org/jira/browse/TIKA-3096
> Project: Tika
>  Issue Type: Bug
>  Components: documentation, example, parser
>Affects Versions: 1.23
>Reporter: suchendra
>Priority: Minor
>
> How do I detect whether a document contains an image or not ?
> val parser = new AutoDetectParser()
>  val handler = new ToXMLContentHandler()
>  parser.parse(tikaIs, handler, new Metadata, new ParseContext)
>  println("File Content:" + handler.toString)
>  
> I tried using HTMLHandler and based on existence of img tag, considered file 
> contains image. Is there any better way to achieve this? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3098) Detecting embedded image

2020-05-07 Thread suchendra (Jira)
suchendra created TIKA-3098:
---

 Summary: Detecting embedded image
 Key: TIKA-3098
 URL: https://issues.apache.org/jira/browse/TIKA-3098
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.24
Reporter: suchendra
 Attachments: test copy.potx

I am trying to detect the embedded image using apache tika, I have a simple 
java code and I am using EmbeddedDocumentExtractor to detect the embedded 
image. 
There is no image as I could see, but tika is detecting the embedded image.

I have attached the file for the reference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3097) Out of memory while parsing docx

2020-05-07 Thread suchendra (Jira)
suchendra created TIKA-3097:
---

 Summary: Out of memory while parsing docx
 Key: TIKA-3097
 URL: https://issues.apache.org/jira/browse/TIKA-3097
 Project: Tika
  Issue Type: Bug
  Components: core, parser
Affects Versions: 1.24
Reporter: suchendra
 Attachments: test.docx

I have written simple Scala code to extract the content from uploaded file 
which is docx. JVM goes OOM when tika tries to parse the file. I have 
configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
with jar as well as in my code.
Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)