[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621191#comment-17621191 ]
Tim Allison commented on TIKA-3890: ----------------------------------- d) writeLimit can be used with /rmeta (and /tika IIRC?). See: "Specifying limits" here: https://cwiki.apache.org/confluence/display/TIKA/TikaServer > Identifying an efficient approach for getting page count prior to running an > extraction > --------------------------------------------------------------------------------------- > > Key: TIKA-3890 > URL: https://issues.apache.org/jira/browse/TIKA-3890 > Project: Tika > Issue Type: Improvement > Components: app > Affects Versions: 2.5.0 > Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores > Docker container with 5.5GB reserved memory, 6GB limit > Tika config w/ 2GB reserved memory, 5GB limit > Reporter: Ethan Wilansky > Priority: Blocker > > Tika is doing a great job with text extraction, until we encounter an Office > document with an unreasonably large number of pages with extractable text. > For example a Word document containing thousands of text pages. > Unfortunately, we don't have an efficient way to determine page count before > calling the /tika or /rmeta endpoints and either getting back an array > allocation error or setting byteArrayMaxOverride to a large number to return > the text or metadata containing the page count. Returning a result other than > the array allocation error can take significant time. > For example, this call: > {{curl -T ./8mb.docx -H "Content-Type: > application/vnd.openxmlformats-officedocument.wordprocessingml.document" > [http://localhost:9998/rmeta/ignore]}} > {quote}{{with the configuration:}} > {{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}} > {{<properties>}} > {{ <parsers>}} > {{ <parser class="org.apache.tika.parser.DefaultParser">}} > {{ <parser-exclude > class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}} > {{ <parser-exclude > class="org.apache.tika.parser.microsoft.OfficeParser"/>}} > {{ </parser>}} > {{ <parser class="org.apache.tika.parser.microsoft.OfficeParser">}} > {{ <params>}} > {{ <param name="byteArrayMaxOverride" type="int">175000000</param>}} > {{ </params>}} > {{ </parser>}} > {{ </parsers>}} > {{ <server>}} > {{ <params>}} > {{ <taskTimeoutMillis>120000</taskTimeoutMillis>}} > {{ <forkedJvmArgs>}} > {{ <arg>-Xms2000m</arg>}} > {{ <arg>-Xmx5000m</arg>}} > {{ </forkedJvmArgs>}} > {{ </params>}} > {{ </server>}} > {{</properties>}} > {quote} > returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds. > Yes, I know this is a huge docx file and I don't want to process it. If I > don't configure {{byteArrayMaxOverride}} I get this exception in just over a > second: > {{Tried to allocate an array of length 172,983,026, but the maximum length > for this record type is 100,000,000.}} which is the preferred result. > The exception is the preferred result. With that in mind, can you answer > these questions? > 1. Will other extractable file types that don't use the OfficeParser also > throw the same array allocation error for very large text extractions? > 2. Is there any way to correlate the array length returned to the number of > lines or pages in the associated file to parse? > 3. Is there an efficient way to calculate lines or pages of extractable > content in a file before sending it for extraction? It doesn't appear that > /rmeta with the /ignore path param significantly improves efficiency over > calling the /tika endpoint or /rmeta w/out /igmore > If its useful, I can share the 8MB docx file containing 14k pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)