[ https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617878#comment-17617878 ]
Ethan Wilansky edited comment on TIKA-3880 at 10/14/22 4:57 PM: ---------------------------------------------------------------- Hi Tim, Good catch! The parser was not wrapped in a parsers element and it does appear to be working now. However, about your question about default parsers, I didn't specify the default parser element in tika config. My understanding (possibly wrong) is that for all other file types, the default parser would be used. Considering the work we are doing, if we are dealing with a file type that has an associated tika parser, we want to allow for text extraction of large files, up to 50 MB file size in our case. Is there a way to set this globally? Would this be the way to do it? {{<?xml version="1.0" encoding="UTF-8"?>}} {{<properties>}} {{ <parsers>}} {{ <parser class="org.apache.tika.parser.DefaultParser"/>}} {{ <params>}} {{ <param name="byteArrayMaxOverride" type="int">\{> default value}. </param>}} {{ </params>}} {{ </parser>}} {{ </parsers>}} {{</properties>}} In case you want to take a closer look, here's the call stack for processing the docx before I had byteArrayMaxOverride properly set: INFO [qtp2027701910-29] 15:33:15,945 org.apache.tika.server.core.resource.DetectorResource Detecting media type for Filename: file.docx INFO [qtp2027701910-27] 15:33:16,979 org.apache.tika.server.core.resource.TikaResource /tika (application/vnd.openxmlformats-officedocument.wordprocessingml.document) WARN [qtp2027701910-27] 15:33:16,995 org.apache.tika.server.core.resource.TikaResource tika: Text extraction failed (null) org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4b23d67b at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.5.0.jar:2.5.0] at java.lang.Thread.run(Thread.java:833) ~[?:?] Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000. If the file is not corrupt and not large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. You can set a higher override value with IOUtils.setByteArrayMaxOverride() at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:599) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:276) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:230) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:203) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:82) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:319) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:123) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.5.0.jar:2.5.0] ... 40 more was (Author: JIRAUSER296936): Hi Tim, Good catch! The parser was not wrapped in a parsers element and it does appear to be working now. However, about your question about default parsers, I didn't specify the default parser element in tika config. My understanding (possibly wrong) is that for all other file types, the default parser would be used. Considering the work we are doing, if we are dealing with a file type that has an associated tika parser, we want to allow for text extraction of large files, up to 50 MB file size in our case. Is there a way to set this globally? Would this be the way to do it? {{<?xml version="1.0" encoding="UTF-8"?>}} {{<properties>}} {{ }}{{<parsers>}} {{ }}{{<parser class="org.apache.tika.parser.DefaultParser"/>}} {{ }}{{ <params>}} {{ <param name="byteArrayMaxOverride" type="int">\{> default value}</param>}} {{ }}{{</params>}} {{ </parser>}} {{ </parsers>}} {{</properties>}} In case you want to take a closer look, here's the call stack for processing the docx before I had byteArrayMaxOverride properly set: INFO [qtp2027701910-29] 15:33:15,945 org.apache.tika.server.core.resource.DetectorResource Detecting media type for Filename: file.docx INFO [qtp2027701910-27] 15:33:16,979 org.apache.tika.server.core.resource.TikaResource /tika (application/vnd.openxmlformats-officedocument.wordprocessingml.document) WARN [qtp2027701910-27] 15:33:16,995 org.apache.tika.server.core.resource.TikaResource tika: Text extraction failed (null) org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4b23d67b at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[tika-server-standard-2.5.0.jar:2.5.0] at java.lang.Thread.run(Thread.java:833) ~[?:?] Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000. If the file is not corrupt and not large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. You can set a higher override value with IOUtils.setByteArrayMaxOverride() at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:599) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:276) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:230) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:203) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:82) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:319) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:123) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) ~[tika-server-standard-2.5.0.jar:2.5.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.5.0.jar:2.5.0] ... 40 more > Tika not picking-up setByteArrayMaxOverride from tika-config > ------------------------------------------------------------ > > Key: TIKA-3880 > URL: https://issues.apache.org/jira/browse/TIKA-3880 > Project: Tika > Issue Type: Improvement > Components: app > Affects Versions: 2.5.0 > Environment: We are running this through docker on a machine with > plenty of memory resources allocated to Docker. > Docker config: 32 GB, 8 processors > Host machine: 64 GB, 32 processors > Our docker-compose configuration is derived from: > [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml] > We are experienced with Docker and are confident that the issue isn't with > Docker. > > Reporter: Ethan Wilansky > Priority: Blocker > > I have specified this parser parameter in tika-config.xml: > <properties> > <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"> > <params> > <paramname="byteArrayMaxOverride"type="int">700000000</param> > </params> > </parser> > </properties> > > I've also verified that the tika-config.xml is being picked-up by Tika on > startup: > org.apache.tika.server.core.TikaServerProcess Using custom config: > /tika-config.xml > > However, when I encounter a very large docx file, I can clearly see that the > configuration in tika-config is not being picked-up: > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 686,679,089, but the maximum length for this record type is > 100,000,000. > If the file is not corrupt and not large, please open an issue on bugzilla to > request > increasing the maximum allowable size for this record type. > You can set a higher override value with IOUtils.setByteArrayMaxOverride() > > I understand that this is a very large docx file. However, we can handle this > amount of text extraction and am fine with the time it takes for Tika to > perform this extraction and the amount of memory required to complete this > extraction. -- This message was sent by Atlassian Jira (v8.20.10#820010)