I'm also interested in a solution. Probably a source code modification is needed, so I have to dig in the source code the find a reasonable solution. The main problem here is that the text extractor does not know the source file name, to be used in a possible text_extraction_error.log file :(
On Thu, Jul 22, 2010 at 1:13 PM, taha ben salah <[email protected]> wrote: > Hi, > I found that some documents failed to be indexed in lucene. > Particularly some Office 2003 documents failed to be parsed (office tika > parser) > You can find out the stacktrace at the end of this submission. > I wonder if there is a way to catch that exception (indexing is done in > astynchronous thread and error is thrown to log only). > It will be even better if we could know (using some public API) the indexing > status of documents (indexed/not yet/failded index). > Any suggestion is very welcome. > Thanks in advance. > Taha > > > > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.officepar...@ced1ac > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:122) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) > at > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189) > at > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195) > at > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:636) > Caused by: org.apache.poi.hpsf.HPSFRuntimeException: Value type of property > ID 1 is not VT_I2 but 2048. > at org.apache.poi.hpsf.Section.<init>(Section.java:262) > at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:452) > at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:247) > at > org.apache.tika.parser.microsoft.OfficeParser.parseSummaryEntryIfExists(OfficeParser.java:148) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:71) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) > -- OpenKM http://www.openkm.com http://www.guia-ubuntu.org
