Hi
I have a Word Document
maie.doc: CDF V2 Document, Little Endian, Os: Windows, Version 5.1, Code
page: 1252, Title: Modul: Unternehmungsf\177hrung 5, Author: APO,
Template: Normal.dot, Last Saved By: APO, Revision Number: 8, Name of
Creating Application: Microsoft Office Word, Last Printed: Sun Apr 26
23:38:00 2009, Create Time/Date: Sun Apr 26 23:38:00 2009, Last Saved
Time/Date: Wed Apr 29 08:45:00 2009, Number of Pages: 1, Number of
Words: 533, Number of Characters: 3364, Security: 0
which tika 0.9 can't parse. It fails with:
java -jar tika-app-0.9.jar ~/Download/maie.doc
Exception in thread "main" org.apache.tika.exception.TikaException:
Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@ec0a9f9
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
Caused by: java.lang.NullPointerException
at
org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:47)
at
org.apache.poi.hwpf.model.PAPX.getParagraphProperties(PAPX.java:136)
at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:828)
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:881)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:127)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 5 more
I can provide you the document in private, if someone is willing to dig
into this. My Java version is
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.7) (suse-1.2.1-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
Thanks
-Tom
--
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C
Tom Gross
email.............@toms-projekte.de
skype.....................tom_gross
web.........http://toms-projekte.de
blog...http://blog.toms-projekte.de