[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

Ethan Wilansky (Jira) Thu, 20 Oct 2022 07:24:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621155#comment-17621155
 ]


Ethan Wilansky commented on TIKA-3890:
--------------------------------------

Thanks Nick and Tim. This is really helpful. Tim, about your questions:
a) Avoid sending large docs to Tika to save on network usage? I don't think 
this is what you're trying to solve, but obv, don't send big files.

You're right, this isn't a concern for us. We run Tika and our application in a 
k8 cluster so network usage isn't a concern.

b) Hitting OOM on tika-server. I mentioned on another ticket how to tell 
tika-server to cache the file to local disk and that Tika is far more efficient 
with actual files for zip-based files and PDF. That won't solve everything. 
We've built tika-server to be robust against OOM. It'll restart. Or, use the 
pipes/async endpoints for robustness. In production on millions of files, 
you'll hit oom, and that's ok.

Yes, thanks Tim. I configured the autoDetectParserConfig element as you 
referenced here: 
[https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters.]
 

Outside of this, I'm assuming it's okay to send a file stream to tika (like 
curl --data-binary <data/file>) instead of uploading the file (like curl -T 
<file>) and have it spool the stream to disk based on the spoolToDisk setting. 
Is that right?

c) Getting a huge amount of text back and wasting network resources. Maybe 
configure gzip compression on results?

Yes, I'm testing that now to see if that helps. However, it's more important 
for us that we don't attempt handling large text extractions in the first place.

d) Getting a huge amount of text back when all you really want is like the 
first 1000000 characters. Set a writeLimit and Tika will stop processing after 
it has extracted that many characters.

I didn't know about this option and it fits our needs perfectly. It looks like 
writeLimit is a configuration setting for the /pipes or /async endpoints. Is 
that correct? I'll work with 
[https://cwiki.apache.org/confluence/display/TIKA/tika-pipes] to see if I can 
get this working.

e) Something else?

No, you and Nick are on target. Thanks again for the fantastic support.

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-3890
>                 URL: https://issues.apache.org/jira/browse/TIKA-3890
>             Project: Tika
>          Issue Type: Improvement
>          Components: app
>    Affects Versions: 2.5.0
>         Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>            Reporter: Ethan Wilansky
>            Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}}
> {{<properties>}}
> {{  <parsers>}}
> {{    <parser class="org.apache.tika.parser.DefaultParser">}}
> {{      <parser-exclude 
> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{      <parser-exclude 
> class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    </parser>}}
> {{    <parser class="org.apache.tika.parser.microsoft.OfficeParser">}}
> {{      <params>}}
> {{        <param name="byteArrayMaxOverride" type="int">175000000</param>}}
> {{      </params>}}
> {{    </parser>}}
> {{  </parsers>}}
> {{  <server>}}
> {{    <params>}}
> {{      <taskTimeoutMillis>120000</taskTimeoutMillis>}}
> {{      <forkedJvmArgs>}}
> {{        <arg>-Xms2000m</arg>}}
> {{        <arg>-Xmx5000m</arg>}}
> {{      </forkedJvmArgs>}}
> {{    </params>}}
> {{  </server>}}
> {{</properties>}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

Reply via email to