[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Abhijit Rajwade (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095133#comment-17095133
 ] 

Abhijit Rajwade commented on TIKA-3094:
---

I am working with [~abchauha] on this issue.

One question.
I do not see reference to SparseBitSet in Tika 1.24 sources.
Is it required because Tika 1.24 uses POI 4.1.2 and POI added dependency on 
SparseBitSet 1.2?

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094735#comment-17094735
 ] 

Tim Allison commented on TIKA-3094:
---

Thank you, [~bob]!

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Abhishek Chauhan (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094638#comment-17094638
 ] 

Abhishek Chauhan commented on TIKA-3094:


Glad ! Thanks for sharing this [~bob]. 

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094614#comment-17094614
 ] 

Bob Paulin commented on TIKA-3094:
--

Thanks [~abchauha] .  The build process adds OSGi specific headers so I'm not 
surprised the approach you describe above was unsuccessful.  The 
maven-bundle-plugin will ensure it is embedded properly.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Abhishek Chauhan (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094594#comment-17094594
 ] 

Abhishek Chauhan commented on TIKA-3094:


[~bob] Please find the .pptx file attached. 

Just would like to add, using 7zip I opened the tika bundle and copied 
additional SparseBitSet 1.2 jar in the bundle, but that did not resolve the 
issue for me.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Abhishek Chauhan (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Chauhan updated TIKA-3094:
---
Attachment: Sample PPT.pptx

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094517#comment-17094517
 ] 

Bob Paulin commented on TIKA-3094:
--

If SparseBitSet is embedded in the tika-bundle that the library itself doesn't 
need the OSGi headers as tika-bundle will provide the headers.  It looks like 
it's Apache Licensed so we should be able to do this.  Can you provide a .pptx 
that is failing?  Would be great if we could integrate into the test suite.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094466#comment-17094466
 ] 

Tim Allison commented on TIKA-3094:
---

[~bobpaulin], is this something we can fix within Tika or do we need to open an 
issue on com.zaxxer?

I tried a couple of combinations of things in the tika-bundle pom.xml and 
couldn't get anything to work.  The difference is that you know what you're 
doing. :D

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Abhishek Chauhan (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094407#comment-17094407
 ] 

Abhishek Chauhan commented on TIKA-3094:


[~tallison] We are calling using OSGI bundle.

Also, the thing I noticed is that 
[https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] is not 
osgified. [ Manifest inside the jar is blank]

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094385#comment-17094385
 ] 

Tim Allison commented on TIKA-3094:
---

How are you calling Tika? Are you using the osgi bundle or calling it directly?

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Sv: Issue with > 200% CPU after bulk usage

2020-04-28 Thread hans.meijer
Hi
I ran in to the issue again with Tika/Java taking more CPU, up to 200+ CPU%.
 
The scenario is that i have 3-4 long running processes calling Tika server
(Version 1.24) and occassionaly 3-4 additional shorter processes (2-3 hours)
starts up and calls the Tika server.
The scenario is being run for a couple of days, extracting text from various
types of documents.

The Tika server is running locally.

 
Top shows this:


--
top - 16:21:17 up 5 days,  8:12,  6 users,  load average: 2,64, 2,63, 2,61
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu(s): 50,8 us,  0,3 sy,  0,0 ni, 48,8 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0
st
KiB Mem :  4032128 total,   129052 free,  2702236 used,  1200840 buff/cache
KiB Swap:  4192252 total,  2968864 free,  1223388 used.  1040340 avail Mem

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
   911 root  20   0 4578604 1,229g   8024 S 204,3 32,0 859:11.22 java
   743 root  20   0  196596   5772920 S   0,7  0,1  35:28.02
wizit_rest
 34637 elastic+  20   0 21,346g 883808  30616 S   0,3 21,9   1250:04 java
 1 root  20   0  204620   3440   2376 S   0,0  0,1   0:14.99 systemd
 2 root  20   0   0  0  0 S   0,0  0,0   0:00.15
kthreadd
 3 root  20   0   0  0  0 S   0,0  0,0   1:46.20
ksoftirqd+
 5 root   0 -20   0  0  0 S   0,0  0,0   0:00.00
kworker/0+
 7 root  20   0   0  0  0 S   0,0  0,0   4:59.14
rcu_sched
 8 root  20   0   0  0  0 S   0,0  0,0   0:00.00 rcu_bh
 9 root  rt   0   0  0  0 S   0,0  0,0   0:03.83
migration+

--


At first i ran the jstackseries.sh:

--
more jstack.911.202904.163848252
Attaching to process ID 911, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.242-b08
Deadlock Detection:

Can't print deadlocks:Unable to deduce type of thread from address
0x7f30bc0
2d800 (expected type JavaThread, CompilerThread, ServiceThread,
JvmtiAgentThread
, or SurrogateLockerThread)

--

It also freeze the system, "systemd[1]: Freezing execution."


But i finally got a threaddump via jstack, i attach that file. I also attach
the tika-config file in case that also could be useful.
Hope this helps to analyze the issue.


Kind regards 
Hans


-Ursprungligt meddelande-
Från: Nick Burch  
Skickat: den 16 april 2020 15:40
Till: hans.mei...@avident-it.se
Kopia: dev@tika.apache.org
Ämne: Re: Issue with > 200% CPU after bulk usage

On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote:
> I have encountered an issue with Tika running locally on a box that 
> the Java runtime goes up to over 200% CPU, after running a bulk load 
> of documents over a couple of days, it is more than 3 million documents.

Can you do a thread dump to show what the JVM is doing?
https://access.redhat.com/solutions/18178

Nick





 














 
















	

















[jira] [Updated] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Abhishek Chauhan (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Chauhan updated TIKA-3094:
---
Description: 
This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
ententions which was earlier working with Apache Tika 1.23 is no longer working 
in 1.24 version.

For .ppt extention it is working fine in both 1.23 and 1.24

 

As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
have updated the POI to 4.1.2. That might be the root cause of this problem. 
POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
which is not present in bundle I guess.

 

 

  was:
This is regressed from 1.23 version of Apache Tika.

For .ppt extention it is working fine in both 1.23 and 1.24

Text extraction for .pptx ententions which was earlier working with Apache Tika 
is no longer working in 1.24 version.


> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24
>Reporter: Abhishek Chauhan
>Priority: Major
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-04-28 Thread Abhishek Chauhan (Jira)
Abhishek Chauhan created TIKA-3094:
--

 Summary: Apache Tika fails to extract text for pptx extension.
 Key: TIKA-3094
 URL: https://issues.apache.org/jira/browse/TIKA-3094
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.24
Reporter: Abhishek Chauhan


This is regressed from 1.23 version of Apache Tika.

For .ppt extention it is working fine in both 1.23 and 1.24

Text extraction for .pptx ententions which was earlier working with Apache Tika 
is no longer working in 1.24 version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)