[ https://issues.apache.org/jira/browse/TIKA-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069170#comment-14069170 ]
Hudson commented on TIKA-1251: ------------------------------ SUCCESS: Integrated in tika-trunk-jdk1.6 #104 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/104/]) Fix for TIKA-1251: RuntimeException with certain word docs (contributed by Vadim Roizman). (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1612373) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/test_TIKA-1251.doc > RuntimeException when parsing word (.doc) documents. Works in Tika 1.4 but > not 1.5 > ---------------------------------------------------------------------------------- > > Key: TIKA-1251 > URL: https://issues.apache.org/jira/browse/TIKA-1251 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.5 > Environment: Linux 3.10.25-gentoo #3 SMP Tue Feb 25 07:35:57 CET 2014 > x86_64 Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz GenuineIntel GNU/Linux > Oracle JDK 1.7.0.51 [oracle-jdk-bin-1.7] and IcedTea JDK 6.1.12.7 > [icedtea-bin-6]. Both fail > Reporter: Andreas > Assignee: Tyler Palsulich > Priority: Critical > Fix For: 1.6 > > Attachments: Ansvarsvakt rutine01.06.11.doc, TIKA-1251.patch > > > Parsing the attached document works in Tika 1.4, but not in Tika 1.5. See > output below. However, using Tika 1.4 is not a proper temporary solution as > it leaves tons of special characters and functions in the output. See my post > on SO: https://stackoverflow.com/questions/21929040 > {noformat} > $ java -jar tika-app-1.4.jar Ansvarsvakt\ rutine01.06.11.doc > /dev/null > $ > $ java -jar tika-app-1.5.jar Ansvarsvakt\ rutine01.06.11.doc > /dev/null > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@193936e1 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112) > Caused by: java.lang.IllegalArgumentException: This paragraph is not the > first one in the table > at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) > at > org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) > at > org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 5 more > {noformat} > Sidenote: If I open the document in Abiword and just click ctrl+s to save the > document (with no changes), Tika 1.5 parses it just fine. -- This message was sent by Atlassian JIRA (v6.2#6252)