[jira] [Closed] (TIKA-3894) Documentation update needed

2022-10-21 Thread Ethan Wilansky (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wilansky closed TIKA-3894.


Thanks Tim!

> Documentation update needed
> ---
>
> Key: TIKA-3894
> URL: https://issues.apache.org/jira/browse/TIKA-3894
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.5.0
>Reporter: Ethan Wilansky
>Priority: Minor
> Fix For: 2.5.1
>
>
> In this documentation: 
> [https://cwiki.apache.org/confluence/display/TIKA/TikaServer,] sections: 
> Filtering Metadata Keys and Filtering Metadata Objects, I believe the  
> and  elements in the configuration examples need to be changed to 
>  and  elements respectively. Here's an example of something I 
> tested for filtering metadata keys:
> {code:xml}
>   ...
>   
>  class="org.apache.tika.metadata.filter.IncludeFieldMetadataFilter">
>   
> 
>   extended-properties:Application
>   xmpTPg:NPages
>   meta:page-count
>   meta:line-count
>   X-TIKA:content
> 
>   
> 
>   
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-21 Thread Ethan Wilansky (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wilansky closed TIKA-3880.


> Tika not picking-up setByteArrayMaxOverride from tika-config
> 
>
> Key: TIKA-3880
> URL: https://issues.apache.org/jira/browse/TIKA-3880
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: We are running this through docker on a machine with 
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: 
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with 
> Docker.
>  
>Reporter: Ethan Wilansky
>Priority: Blocker
> Fix For: 2.5.0
>
>
> I have specified this parser parameter in tika-config.xml:
> 
>   
>     
>       7
>     
> 
> 
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on 
> startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: 
> /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the 
> configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 686,679,089, but the maximum length for this record type is 
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to 
> request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this 
> amount of text extraction and am fine with the time it takes for Tika to 
> perform this extraction and the amount of memory required to complete this 
> extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3894) Documentation update needed

2022-10-20 Thread Ethan Wilansky (Jira)
Ethan Wilansky created TIKA-3894:


 Summary: Documentation update needed
 Key: TIKA-3894
 URL: https://issues.apache.org/jira/browse/TIKA-3894
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.5.0
Reporter: Ethan Wilansky


In this documentation: 
[https://cwiki.apache.org/confluence/display/TIKA/TikaServer,] sections: 
Filtering Metadata Keys and Filtering Metadata Objects, I believe the  
and  elements in the configuration examples need to be changed to 
 and  elements respectively. Here's an example of something I 
tested for filtering metadata keys:

{code:xml}
  ...
  

  

  extended-properties:Application
  xmpTPg:NPages
  meta:page-count
  meta:line-count
  X-TIKA:content

  

  
...
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-20 Thread Ethan Wilansky (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wilansky closed TIKA-3890.

Fix Version/s: 2.5.0
   Resolution: Fixed

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
> Fix For: 2.5.0
>
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-20 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621322#comment-17621322
 ] 

Ethan Wilansky commented on TIKA-3890:
--

Great information, thanks. I'll close this issue.

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-20 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621155#comment-17621155
 ] 

Ethan Wilansky commented on TIKA-3890:
--

Thanks Nick and Tim. This is really helpful. Tim, about your questions:
a) Avoid sending large docs to Tika to save on network usage? I don't think 
this is what you're trying to solve, but obv, don't send big files.

You're right, this isn't a concern for us. We run Tika and our application in a 
k8 cluster so network usage isn't a concern.

b) Hitting OOM on tika-server. I mentioned on another ticket how to tell 
tika-server to cache the file to local disk and that Tika is far more efficient 
with actual files for zip-based files and PDF. That won't solve everything. 
We've built tika-server to be robust against OOM. It'll restart. Or, use the 
pipes/async endpoints for robustness. In production on millions of files, 
you'll hit oom, and that's ok.

Yes, thanks Tim. I configured the autoDetectParserConfig element as you 
referenced here: 
[https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters.]
 

Outside of this, I'm assuming it's okay to send a file stream to tika (like 
curl --data-binary ) instead of uploading the file (like curl -T 
) and have it spool the stream to disk based on the spoolToDisk setting. 
Is that right?

c) Getting a huge amount of text back and wasting network resources. Maybe 
configure gzip compression on results?

Yes, I'm testing that now to see if that helps. However, it's more important 
for us that we don't attempt handling large text extractions in the first place.

d) Getting a huge amount of text back when all you really want is like the 
first 100 characters. Set a writeLimit and Tika will stop processing after 
it has extracted that many characters.

I didn't know about this option and it fits our needs perfectly. It looks like 
writeLimit is a configuration setting for the /pipes or /async endpoints. Is 
that correct? I'll work with 
[https://cwiki.apache.org/confluence/display/TIKA/tika-pipes] to see if I can 
get this working.

e) Something else?

No, you and Nick are on target. Thanks again for the fantastic support.

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> 

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620630#comment-17620630
 ] 

Ethan Wilansky commented on TIKA-3890:
--

Aha, I'll have to give Apache POI a try. Thanks Nick. It would be useful to get 
an extracted file size estimate. For example, the 8mb docx file generated a 
31MB text file. Is there a way in Tika to estimate extraction size beforehand?  

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wilansky updated TIKA-3890:
-
Description: 
Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back an array allocation error or setting  
byteArrayMaxOverride to a large number to return the text or metadata 
containing the page count. Returning a result other than the array allocation 
error can take significant time.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.

  was:
Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back a record size error or setting  
byteArrayMaxOverride to a large number to either return the text or metadata 
containing the page count. In both cases, this can take significant time to 
return a result.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.


> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika 

[jira] [Created] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)
Ethan Wilansky created TIKA-3890:


 Summary: Identifying an efficient approach for getting page count 
prior to running an extraction
 Key: TIKA-3890
 URL: https://issues.apache.org/jira/browse/TIKA-3890
 Project: Tika
  Issue Type: Improvement
  Components: app
Affects Versions: 2.5.0
 Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
Docker container with 5.5GB reserved memory, 6GB limit
Tika config w/ 2GB reserved memory, 5GB limit 
Reporter: Ethan Wilansky


Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back a record size error or setting  
byteArrayMaxOverride to a large number to either return the text or metadata 
containing the page count. In both cases, this can take significant time to 
return a result.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-17 Thread Ethan Wilansky (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wilansky resolved TIKA-3880.
--
Fix Version/s: 2.5.0
   Resolution: Resolved

Confirmed that the setByteArrayMaxOverride setting is being read and applied to 
the targeted parser.

> Tika not picking-up setByteArrayMaxOverride from tika-config
> 
>
> Key: TIKA-3880
> URL: https://issues.apache.org/jira/browse/TIKA-3880
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: We are running this through docker on a machine with 
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: 
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with 
> Docker.
>  
>Reporter: Ethan Wilansky
>Priority: Blocker
> Fix For: 2.5.0
>
>
> I have specified this parser parameter in tika-config.xml:
> 
>   
>     
>       7
>     
> 
> 
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on 
> startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: 
> /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the 
> configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 686,679,089, but the maximum length for this record type is 
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to 
> request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this 
> amount of text extraction and am fine with the time it takes for Tika to 
> perform this extraction and the amount of memory required to complete this 
> extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-14 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617934#comment-17617934
 ] 

Ethan Wilansky commented on TIKA-3880:
--

Hi Tim,

I see I couldn't add a comment directly to the page so mentioning here. On this 
page: 
[https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters,]
 under 4. AutoDetectParserConfig, mismatch on element name:
{{{}<{}}}{{{}maximumCompressionRatio{color:#FF}n{color}{}}}{{{}>100{}}}
 => 
{{{}<{}}}{{{}maximumCompressionRatio{}}}{{{}>100{}}}

> Tika not picking-up setByteArrayMaxOverride from tika-config
> 
>
> Key: TIKA-3880
> URL: https://issues.apache.org/jira/browse/TIKA-3880
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: We are running this through docker on a machine with 
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: 
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with 
> Docker.
>  
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> I have specified this parser parameter in tika-config.xml:
> 
>   
>     
>       7
>     
> 
> 
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on 
> startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: 
> /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the 
> configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 686,679,089, but the maximum length for this record type is 
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to 
> request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this 
> amount of text extraction and am fine with the time it takes for Tika to 
> perform this extraction and the amount of memory required to complete this 
> extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-14 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617902#comment-17617902
 ] 

Ethan Wilansky commented on TIKA-3880:
--

Thanks for the reference about large file processing, much appreciated!

About my comment, "...text extraction of large files, up to 50 MB file size in 
our case.", yes, we do know before sending the file to Tika and check file size 
before sending it for extraction. My point was that we are sending large files 
for extraction and want to configure Tika to parse these large files regardless 
of file type. We are careful not to send files to Tika that can't be parsed. We 
use the Tika detect endpoint to verify mimetype before sending files for text 
extraction.  About your last comment, I will be sure to set the default parser 
element in the config. Thanks again for your responses, Tim.

> Tika not picking-up setByteArrayMaxOverride from tika-config
> 
>
> Key: TIKA-3880
> URL: https://issues.apache.org/jira/browse/TIKA-3880
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: We are running this through docker on a machine with 
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: 
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with 
> Docker.
>  
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> I have specified this parser parameter in tika-config.xml:
> 
>   
>     
>       7
>     
> 
> 
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on 
> startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: 
> /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the 
> configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 686,679,089, but the maximum length for this record type is 
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to 
> request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this 
> amount of text extraction and am fine with the time it takes for Tika to 
> perform this extraction and the amount of memory required to complete this 
> extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-14 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617879#comment-17617879
 ] 

Ethan Wilansky commented on TIKA-3880:
--

I'll try the config you posted.

> Tika not picking-up setByteArrayMaxOverride from tika-config
> 
>
> Key: TIKA-3880
> URL: https://issues.apache.org/jira/browse/TIKA-3880
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: We are running this through docker on a machine with 
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: 
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with 
> Docker.
>  
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> I have specified this parser parameter in tika-config.xml:
> 
>   
>     
>       7
>     
> 
> 
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on 
> startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: 
> /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the 
> configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 686,679,089, but the maximum length for this record type is 
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to 
> request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this 
> amount of text extraction and am fine with the time it takes for Tika to 
> perform this extraction and the amount of memory required to complete this 
> extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-14 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617878#comment-17617878
 ] 

Ethan Wilansky edited comment on TIKA-3880 at 10/14/22 4:57 PM:


Hi Tim,

Good catch! The parser was not wrapped in a parsers element and it does appear 
to be working now. However, about your question about default parsers, I didn't 
specify the default parser element in tika config. My understanding (possibly 
wrong) is that for all other file types, the default parser would be used. 
Considering the work we are doing, if we are dealing with a file type that has 
an associated tika parser, we want to allow for text extraction of large files, 
up to 50 MB file size in our case. Is there a way to set this globally? Would 
this be the way to do it?

{{}}
{{}}
{{   }}
{{       }}
{{         }}
{{       \{> default value}. 
}}
{{              }}
{{    }}
{{   }}
{{}}

In case you want to take a closer look, here's the call stack for processing 
the docx before I had byteArrayMaxOverride properly set:

INFO  [qtp2027701910-29] 15:33:15,945 
org.apache.tika.server.core.resource.DetectorResource Detecting media type for 
Filename: file.docx
INFO  [qtp2027701910-27] 15:33:16,979 
org.apache.tika.server.core.resource.TikaResource /tika 
(application/vnd.openxmlformats-officedocument.wordprocessingml.document)
WARN  [qtp2027701910-27] 15:33:16,995 
org.apache.tika.server.core.resource.TikaResource tika: Text extraction failed 
(null)
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4b23d67b
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 

[jira] [Commented] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-14 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617878#comment-17617878
 ] 

Ethan Wilansky commented on TIKA-3880:
--

Hi Tim,

Good catch! The parser was not wrapped in a parsers element and it does appear 
to be working now. However, about your question about default parsers, I didn't 
specify the default parser element in tika config. My understanding (possibly 
wrong) is that for all other file types, the default parser would be used. 
Considering the work we are doing, if we are dealing with a file type that has 
an associated tika parser, we want to allow for text extraction of large files, 
up to 50 MB file size in our case. Is there a way to set this globally? Would 
this be the way to do it?

{{}}
{{}}
{{   }}{{}}
{{      }}{{}}
{{       }}{{  }}
{{           \{> default 
value}}}
{{          }}{{}}
{{       }}
{{     }}
{{}}

In case you want to take a closer look, here's the call stack for processing 
the docx before I had byteArrayMaxOverride properly set:

INFO  [qtp2027701910-29] 15:33:15,945 
org.apache.tika.server.core.resource.DetectorResource Detecting media type for 
Filename: file.docx
INFO  [qtp2027701910-27] 15:33:16,979 
org.apache.tika.server.core.resource.TikaResource /tika 
(application/vnd.openxmlformats-officedocument.wordprocessingml.document)
WARN  [qtp2027701910-27] 15:33:16,995 
org.apache.tika.server.core.resource.TikaResource tika: Text extraction failed 
(null)
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4b23d67b
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:502)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1616) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
~[tika-server-standard-2.5.0.jar:2.5.0]
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
 ~[tika-server-standard-2.5.0.jar:2.5.0]
    at 

[jira] [Created] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

2022-10-14 Thread Ethan Wilansky (Jira)
Ethan Wilansky created TIKA-3880:


 Summary: Tika not picking-up setByteArrayMaxOverride from 
tika-config
 Key: TIKA-3880
 URL: https://issues.apache.org/jira/browse/TIKA-3880
 Project: Tika
  Issue Type: Improvement
  Components: app
Affects Versions: 2.5.0
 Environment: We are running this through docker on a machine with 
plenty of memory resources allocated to Docker.

Docker config: 32 GB, 8 processors
Host machine: 64 GB, 32 processors

Our docker-compose configuration is derived from: 
[https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]

We are experienced with Docker and are confident that the issue isn't with 
Docker.

 
Reporter: Ethan Wilansky


I have specified this parser parameter in tika-config.xml:



  
    
      7
    


 
I've also verified that the tika-config.xml is being picked-up by Tika on 
startup:
  org.apache.tika.server.core.TikaServerProcess Using custom config: 
/tika-config.xml
 
However, when I encounter a very large docx file, I can clearly see that the 
configuration in tika-config is not being picked-up:
 
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 686,679,089, but the maximum length for this record type is 
100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to 
request 
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
 
I understand that this is a very large docx file. However, we can handle this 
amount of text extraction and am fine with the time it takes for Tika to 
perform this extraction and the amount of memory required to complete this 
extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)