[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095133#comment-17095133 ] Abhijit Rajwade commented on TIKA-3094: --- I am working with [~abchauha] on this issue. One question. I do not see reference to SparseBitSet in Tika 1.24 sources. Is it required because Tika 1.24 uses POI 4.1.2 and POI added dependency on SparseBitSet 1.2? > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094735#comment-17094735 ] Tim Allison commented on TIKA-3094: --- Thank you, [~bob]! > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094638#comment-17094638 ] Abhishek Chauhan commented on TIKA-3094: Glad ! Thanks for sharing this [~bob]. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094614#comment-17094614 ] Bob Paulin commented on TIKA-3094: -- Thanks [~abchauha] . The build process adds OSGi specific headers so I'm not surprised the approach you describe above was unsuccessful. The maven-bundle-plugin will ensure it is embedded properly. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094594#comment-17094594 ] Abhishek Chauhan commented on TIKA-3094: [~bob] Please find the .pptx file attached. Just would like to add, using 7zip I opened the tika bundle and copied additional SparseBitSet 1.2 jar in the bundle, but that did not resolve the issue for me. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Chauhan updated TIKA-3094: --- Attachment: Sample PPT.pptx > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094517#comment-17094517 ] Bob Paulin commented on TIKA-3094: -- If SparseBitSet is embedded in the tika-bundle that the library itself doesn't need the OSGi headers as tika-bundle will provide the headers. It looks like it's Apache Licensed so we should be able to do this. Can you provide a .pptx that is failing? Would be great if we could integrate into the test suite. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094466#comment-17094466 ] Tim Allison commented on TIKA-3094: --- [~bobpaulin], is this something we can fix within Tika or do we need to open an issue on com.zaxxer? I tried a couple of combinations of things in the tika-bundle pom.xml and couldn't get anything to work. The difference is that you know what you're doing. :D > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094407#comment-17094407 ] Abhishek Chauhan commented on TIKA-3094: [~tallison] We are calling using OSGI bundle. Also, the thing I noticed is that [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] is not osgified. [ Manifest inside the jar is blank] > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094385#comment-17094385 ] Tim Allison commented on TIKA-3094: --- How are you calling Tika? Are you using the osgi bundle or calling it directly? > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
Sv: Issue with > 200% CPU after bulk usage
Hi I ran in to the issue again with Tika/Java taking more CPU, up to 200+ CPU%. The scenario is that i have 3-4 long running processes calling Tika server (Version 1.24) and occassionaly 3-4 additional shorter processes (2-3 hours) starts up and calls the Tika server. The scenario is being run for a couple of days, extracting text from various types of documents. The Tika server is running locally. Top shows this: -- top - 16:21:17 up 5 days, 8:12, 6 users, load average: 2,64, 2,63, 2,61 Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie %Cpu(s): 50,8 us, 0,3 sy, 0,0 ni, 48,8 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem : 4032128 total, 129052 free, 2702236 used, 1200840 buff/cache KiB Swap: 4192252 total, 2968864 free, 1223388 used. 1040340 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 911 root 20 0 4578604 1,229g 8024 S 204,3 32,0 859:11.22 java 743 root 20 0 196596 5772920 S 0,7 0,1 35:28.02 wizit_rest 34637 elastic+ 20 0 21,346g 883808 30616 S 0,3 21,9 1250:04 java 1 root 20 0 204620 3440 2376 S 0,0 0,1 0:14.99 systemd 2 root 20 0 0 0 0 S 0,0 0,0 0:00.15 kthreadd 3 root 20 0 0 0 0 S 0,0 0,0 1:46.20 ksoftirqd+ 5 root 0 -20 0 0 0 S 0,0 0,0 0:00.00 kworker/0+ 7 root 20 0 0 0 0 S 0,0 0,0 4:59.14 rcu_sched 8 root 20 0 0 0 0 S 0,0 0,0 0:00.00 rcu_bh 9 root rt 0 0 0 0 S 0,0 0,0 0:03.83 migration+ -- At first i ran the jstackseries.sh: -- more jstack.911.202904.163848252 Attaching to process ID 911, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.242-b08 Deadlock Detection: Can't print deadlocks:Unable to deduce type of thread from address 0x7f30bc0 2d800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread , or SurrogateLockerThread) -- It also freeze the system, "systemd[1]: Freezing execution." But i finally got a threaddump via jstack, i attach that file. I also attach the tika-config file in case that also could be useful. Hope this helps to analyze the issue. Kind regards Hans -Ursprungligt meddelande- Från: Nick Burch Skickat: den 16 april 2020 15:40 Till: hans.mei...@avident-it.se Kopia: dev@tika.apache.org Ämne: Re: Issue with > 200% CPU after bulk usage On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote: > I have encountered an issue with Tika running locally on a box that > the Java runtime goes up to over 200% CPU, after running a bulk load > of documents over a couple of days, it is more than 3 million documents. Can you do a thread dump to show what the JVM is doing? https://access.redhat.com/solutions/18178 Nick
[jira] [Updated] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Chauhan updated TIKA-3094: --- Description: This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx ententions which was earlier working with Apache Tika 1.23 is no longer working in 1.24 version. For .ppt extention it is working fine in both 1.23 and 1.24 As I referred to release notes [https://tika.apache.org/1.24/index.html], you have updated the POI to 4.1.2. That might be the root cause of this problem. POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] which is not present in bundle I guess. was: This is regressed from 1.23 version of Apache Tika. For .ppt extention it is working fine in both 1.23 and 1.24 Text extraction for .pptx ententions which was earlier working with Apache Tika is no longer working in 1.24 version. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24 >Reporter: Abhishek Chauhan >Priority: Major > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
Abhishek Chauhan created TIKA-3094: -- Summary: Apache Tika fails to extract text for pptx extension. Key: TIKA-3094 URL: https://issues.apache.org/jira/browse/TIKA-3094 Project: Tika Issue Type: Bug Affects Versions: 1.24 Reporter: Abhishek Chauhan This is regressed from 1.23 version of Apache Tika. For .ppt extention it is working fine in both 1.23 and 1.24 Text extraction for .pptx ententions which was earlier working with Apache Tika is no longer working in 1.24 version. -- This message was sent by Atlassian Jira (v8.3.4#803005)