https://issues.apache.org/bugzilla/show_bug.cgi?id=54823
Bug ID: 54823
Summary: Wrong type on Total Time field in
org.openxmlformats.schemas.officeDocument.x2006.extend
edProperties.CTProperties
Product: POI
Version: 3.8
Hardware: PC
OS: Linux
Status: NEW
Severity: trivial
Priority: P2
Component: POI Overall
Assignee: [email protected]
Reporter: [email protected]
Classification: Unclassified
Hello, devs from Apache POI
I got this error while parsing Microsoft Word document using Apache Tika
parser.
org.apache.tika.exception.TikaException: Error creating OOXML extractor
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:125)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
at
xxx.yyyy.services.impl.LuceneServiceImpl.fillDocumentFields(LuceneServiceImpl.java:167)
at
xxx.yyyy.services.impl.LuceneServiceImpl.createLuceneDocumentForFile(LuceneServiceImpl.java:624)
at
xxx.yyyy.services.impl.LuceneServiceImpl.indexNewFile(LuceneServiceImpl.java:650)
at $LuceneService_63044c23b5df.indexNewFile(Unknown Source)
at $LuceneService_63044c23b5e0.advised$indexNewFile_63044c23b5fa(Unknown
Source)
at
$LuceneService_63044c23b5e0$Invocation_indexNewFile_63044c23b5f9.proceedToAdvisedMethod(Unknown
Source)
at
org.apache.tapestry5.internal.plastic.AbstractMethodInvocation.proceed(AbstractMethodInvocation.java:84)
at xxx.yyyy.services.logging.LoggingAdvice.advise(LoggingAdvice.java:29)
at
org.apache.tapestry5.internal.plastic.AbstractMethodInvocation.proceed(AbstractMethodInvocation.java:86)
at $LuceneService_63044c23b5e0.indexNewFile(Unknown Source)
at $LuceneService_63044c23b59b.indexNewFile(Unknown Source)
at
xxx.yyyy.services.impl.IndexScheduleServiceImpl.executeDocumentActions(IndexScheduleServiceImpl.java:119)
at
xxx.yyyy.services.impl.IndexScheduleServiceImpl.access$0(IndexScheduleServiceImpl.java:76)
at
xxx.yyyy.services.impl.IndexScheduleServiceImpl$1.run(IndexScheduleServiceImpl.java:50)
at
org.apache.tapestry5.ioc.internal.services.cron.PeriodicExecutorImpl$Job.invoke(PeriodicExecutorImpl.java:178)
at
org.apache.tapestry5.ioc.internal.services.cron.PeriodicExecutorImpl$Job.invoke(PeriodicExecutorImpl.java:48)
at
org.apache.tapestry5.ioc.internal.services.ParallelExecutorImpl$1.call(ParallelExecutorImpl.java:58)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.xmlbeans.impl.values.XmlValueOutOfRangeException: Invalid
int value: 4294934530
at
org.apache.xmlbeans.impl.values.JavaIntHolder.set_text(JavaIntHolder.java:43)
at
org.apache.xmlbeans.impl.values.XmlObjectBase.update_from_wscanon_text(XmlObjectBase.java:1135)
at
org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated(XmlObjectBase.java:1274)
at
org.apache.xmlbeans.impl.values.JavaIntHolder.intValue(JavaIntHolder.java:53)
at
org.apache.xmlbeans.impl.values.XmlObjectBase.getIntValue(XmlObjectBase.java:1500)
at
org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.impl.CTPropertiesImpl.getTotalTime(Unknown
Source)
at
org.apache.tika.parser.microsoft.ooxml.MetadataExtractor.extractMetadata(MetadataExtractor.java:123)
at
org.apache.tika.parser.microsoft.ooxml.MetadataExtractor.extract(MetadataExtractor.java:61)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:115)
... 27 more
So I investigate the problem and it's seems that line 123 in class
org.apache.tika.parser.microsoft.ooxml.MetadataExtractor
addProperty(metadata, OfficeOpenXMLExtended.TOTAL_TIME,
propsHolder.getTotalTime());
Total Time is long at runtime and this excepts only int.
This bug is not related with Apache Tika, but with this interface
org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.CTProperties
which is part of poi-ooxml-schemas ver. 3.8 and used by Apache Tika.
Interface CTProperties defines return type of the method getTotalTime() as int
but at runtime is the value is long and it should be changed with long.
My workaround copy classes
MetadataExtractor, OOXMLExtractorFactory and override class OOXMLParser (add
method getUnsupportedTypes) and remove parsing of TOTAL_TIME, because I never
use this field.
This workaround can be applied when you use Apache Tika for parsing .docx
documents.
Best Regards, Gjorgji
p.s I hope I was very detail in my explanation
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]