[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682924#comment-13682924 ] Ray Gauss II commented on TIKA-1130: Test file and method committed in r1492909. This was just added onto {{OOXMLParserTest}} and named with a {{disabled}} prefix rather than using {{@Ignore}}. I think we should start moving towards that for new test classes though. > .docx text extract leaves out some portions of text > --- > > Key: TIKA-1130 > URL: https://issues.apache.org/jira/browse/TIKA-1130 > Project: Tika > Issue Type: Bug >Affects Versions: 1.2, 1.3 > Environment: OpenJDK x86_64 >Reporter: Daniel Gibby >Priority: Critical > Attachments: Resume 6.4.13.docx > > > When parsing a Microsoft Word .docx > (application/vnd.openxmlformats-officedocument.wordprocessingml.document), > certain portions of text remain unextracted. > I have attached a .docx file that can be tested against. The 'gray' portions > of text are what are not extracted, while the darker colored text extracts > fine. > Looking at the document.xml portion of the .docx zip file shows the text is > all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682761#comment-13682761 ] Nick Burch commented on TIKA-1130: -- I think we've tended to prefix the method name, rather than commenting out, so it's more obvious that they want re-enabling later. Pop a note of the tika bug number, and POI bug number in the javadoc for the method, so someone later can easily work out why it was disabled and when it might be ready That said, maybe this is our change to move at least one test to JUnit 4, so we can use @Ignore? > .docx text extract leaves out some portions of text > --- > > Key: TIKA-1130 > URL: https://issues.apache.org/jira/browse/TIKA-1130 > Project: Tika > Issue Type: Bug >Affects Versions: 1.2, 1.3 > Environment: OpenJDK x86_64 >Reporter: Daniel Gibby >Priority: Critical > Attachments: Resume 6.4.13.docx > > > When parsing a Microsoft Word .docx > (application/vnd.openxmlformats-officedocument.wordprocessingml.document), > certain portions of text remain unextracted. > I have attached a .docx file that can be tested against. The 'gray' portions > of text are what are not extracted, while the darker colored text extracts > fine. > Looking at the document.xml portion of the .docx zip file shows the text is > all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682644#comment-13682644 ] Ray Gauss II commented on TIKA-1130: I've created a unit test that reproduces the issue with a stripped down version of the original file. Shall I comment out the actual test and commit? > .docx text extract leaves out some portions of text > --- > > Key: TIKA-1130 > URL: https://issues.apache.org/jira/browse/TIKA-1130 > Project: Tika > Issue Type: Bug >Affects Versions: 1.2, 1.3 > Environment: OpenJDK x86_64 >Reporter: Daniel Gibby >Priority: Critical > Attachments: Resume 6.4.13.docx > > > When parsing a Microsoft Word .docx > (application/vnd.openxmlformats-officedocument.wordprocessingml.document), > certain portions of text remain unextracted. > I have attached a .docx file that can be tested against. The 'gray' portions > of text are what are not extracted, while the darker colored text extracts > fine. > Looking at the document.xml portion of the .docx zip file shows the text is > all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682628#comment-13682628 ] Tim Allison commented on TIKA-1132: --- Tika gui took longer than I was willing to wait, too. tika.parseToString() returned a value in about 30 seconds. As you both suggested, the fraction formatter was likely the culprit. I just submitted a patch to poi 54686. > Parsing some XLS documents hangs entire JVM, requires kill -9 > - > > Key: TIKA-1132 > URL: https://issues.apache.org/jira/browse/TIKA-1132 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3 > Environment: Linux Suse: > java version "1.7.0" > Java(TM) SE Runtime Environment (build 1.7.0-b147) > Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) > OSX 10.8.3: > java version "1.7.0_06" > Java(TM) SE Runtime Environment (build 1.7.0_06-b24) > Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) >Reporter: Ryan Krueger > Fix For: 1.1 > > Attachments: mod3.xlsx, mod.xls > > > Some XLS documents hang the entire JVM. A control-C or regular kill won't > stop the JVM, a kill -9 is required. > We're running within an email server application parsing documents to extract > text of all attachments. When we hit a message with the affected attachment > the entire JVM hangs and we mark the message to skip extracting the text from > the affected message the next attempt. Unfortunately, it kills all email > processing on the server until the internal watchdogs kill -9 the application. > We have seen the issue for several months with different documents, but they > are always Excel files. Some get complaints from Excel when opening but not > all. > In addition to experiencing the problem on our Linux servers I have tested on > OSX and experienced the same problems. I ran the Tika UI and select the > affected file or run the CLI. The problem is the same. > Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls > When running on multi-CPU machines there are two threads running at 100% > every time. > I have attached a document that triggers the error. > I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is > accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682593#comment-13682593 ] Nick Burch commented on TIKA-1130: -- We'll hopefully get an updated version of Tim's patch in POI soon. Once there's then a POI release (expected shortly), Tika can upgrade, and fingers crossed the text will show up! In the mean time, it would be good if someone could produce a junit unit test for Tika, showing the current issue. That'll let us ensure it gets fixed in Tika with the upgrade, and that it stays fixed into the future... > .docx text extract leaves out some portions of text > --- > > Key: TIKA-1130 > URL: https://issues.apache.org/jira/browse/TIKA-1130 > Project: Tika > Issue Type: Bug >Affects Versions: 1.2, 1.3 > Environment: OpenJDK x86_64 >Reporter: Daniel Gibby >Priority: Critical > Attachments: Resume 6.4.13.docx > > > When parsing a Microsoft Word .docx > (application/vnd.openxmlformats-officedocument.wordprocessingml.document), > certain portions of text remain unextracted. > I have attached a .docx file that can be tested against. The 'gray' portions > of text are what are not extracted, while the darker colored text extracts > fine. > Looking at the document.xml portion of the .docx zip file shows the text is > all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira