date:20221019



 [ 
https://issues.apache.org/jira/browse/TIKA-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3885.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

> Move AsyncProcessor's main to a new module
> --
>
> Key: TIKA-3885
> URL: https://issues.apache.org/jira/browse/TIKA-3885
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.5.1
>
>
> On TIKA-3876, we added a main() to AsyncProcessor. It would be helpful to 
> move this functionality out of AsyncProcessor/tika-core into its own module.  
> That'll allow us to package logging etc and tika-core in a new module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3886) Inject PDF annotation type into embedded files' metadata

Tim Allison created TIKA-3886:
-

 Summary: Inject PDF annotation type into embedded files' metadata
 Key: TIKA-3886
 URL: https://issues.apache.org/jira/browse/TIKA-3886
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


In PDFs, embedded files may appear in annotations with different types, e.g. 
3D.  It would be helpful to associate the annotation types with the embedded 
files by adding a metadata item to the embedded file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3885) Move AsyncProcessor's main to a new module



[ 
https://issues.apache.org/jira/browse/TIKA-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620449#comment-17620449
 ] 

Hudson commented on TIKA-3885:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #854 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/854/])
TIKA-3885: Add a tika-async-cli module (tallison: 
[https://github.com/apache/tika/commit/2b9ba8612b20d2779863f08b908e89cc001b483f])
* (edit) tika-bom/pom.xml
* (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* (add) 
tika-pipes/tika-async-cli/src/main/java/org/apache/tika/async/cli/TikaAsyncCLI.java
* (edit) CHANGES.txt
* (edit) tika-app/pom.xml
* (add) tika-pipes/tika-async-cli/pom.xml
* (edit) pom.xml
* (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java
* (edit) tika-pipes/pom.xml


> Move AsyncProcessor's main to a new module
> --
>
> Key: TIKA-3885
> URL: https://issues.apache.org/jira/browse/TIKA-3885
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.5.1
>
>
> On TIKA-3876, we added a main() to AsyncProcessor. It would be helpful to 
> move this functionality out of AsyncProcessor/tika-core into its own module.  
> That'll allow us to package logging etc and tika-core in a new module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3887) Store PDActions and triggers in file's metadata

Tim Allison created TIKA-3887:
-

 Summary: Store PDActions and triggers in file's metadata
 Key: TIKA-3887
 URL: https://issues.apache.org/jira/browse/TIKA-3887
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3888) to for checkstyle configs

Tim Allison created TIKA-3888:
-

 Summary:  to  for checkstyle configs
 Key: TIKA-3888
 URL: https://issues.apache.org/jira/browse/TIKA-3888
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-3887) Store PDActions and triggers in file's metadata



 [ 
https://issues.apache.org/jira/browse/TIKA-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3887.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

> Store PDActions and triggers in file's metadata
> ---
>
> Key: TIKA-3887
> URL: https://issues.apache.org/jira/browse/TIKA-3887
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-3888) to for checkstyle configs



 [ 
https://issues.apache.org/jira/browse/TIKA-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3888.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

>  to  for checkstyle configs
> 
>
> Key: TIKA-3888
> URL: https://issues.apache.org/jira/browse/TIKA-3888
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs

Tim Allison created TIKA-3889:
-

 Summary: Include counts of 3d objects in addition to current 
boolean has3D in PDFs
 Key: TIKA-3889
 URL: https://issues.apache.org/jira/browse/TIKA-3889
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs



 [ 
https://issues.apache.org/jira/browse/TIKA-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3889.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

> Include counts of 3d objects in addition to current boolean has3D in PDFs
> -
>
> Key: TIKA-3889
> URL: https://issues.apache.org/jira/browse/TIKA-3889
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3888) to for checkstyle configs



[ 
https://issues.apache.org/jira/browse/TIKA-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620577#comment-17620577
 ] 

Hudson commented on TIKA-3888:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3888 (tallison: 
[https://github.com/apache/tika/commit/54e2e77e3f5672cd77372716e78a2f237d1e3043])
* (edit) tika-parsers/pom.xml
* (edit) tika-example/pom.xml
* (edit) tika-serialization/pom.xml
* (edit) tika-server/pom.xml
* (edit) tika-langdetect/pom.xml
* (edit) tika-batch/pom.xml
* (edit) tika-pipes/pom.xml
* (edit) tika-eval/pom.xml
* (edit) tika-fuzzing/pom.xml
* (edit) tika-core/pom.xml


>  to  for checkstyle configs
> 
>
> Key: TIKA-3888
> URL: https://issues.apache.org/jira/browse/TIKA-3888
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs



[ 
https://issues.apache.org/jira/browse/TIKA-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620578#comment-17620578
 ] 

Hudson commented on TIKA-3889:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3889 -- include counts of 3d objects (tallison: 
[https://github.com/apache/tika/commit/f6264c7044148f98dd733b9194a92918bb36bea7])
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java


> Include counts of 3d objects in addition to current boolean has3D in PDFs
> -
>
> Key: TIKA-3889
> URL: https://issues.apache.org/jira/browse/TIKA-3889
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3886) Inject PDF annotation type into embedded files' metadata



[ 
https://issues.apache.org/jira/browse/TIKA-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620579#comment-17620579
 ] 

Hudson commented on TIKA-3886:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3886 -- extract annotationtype for embedded files in PDFs (tallison: 
[https://github.com/apache/tika/commit/fd474b6541cb397e9a1db4965b1725b1d9b5e241])
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
TIKA-3886 -- Extract PDF actions and triggers into the file's metadata 
(tallison: 
[https://github.com/apache/tika/commit/5062690cb18be20a6bde5b5e5e55755586c79ee2])
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) CHANGES.txt


> Inject PDF annotation type into embedded files' metadata
> 
>
> Key: TIKA-3886
> URL: https://issues.apache.org/jira/browse/TIKA-3886
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>
> In PDFs, embedded files may appear in annotations with different types, e.g. 
> 3D.  It would be helpful to associate the annotation types with the embedded 
> files by adding a metadata item to the embedded file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3887) Store PDActions and triggers in file's metadata



[ 
https://issues.apache.org/jira/browse/TIKA-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620580#comment-17620580
 ] 

Hudson commented on TIKA-3887:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3887 -- Extract PDF actions and triggers into the file's metadata -- fix 
CHANGES.txt (tallison: 
[https://github.com/apache/tika/commit/291a74147c6999c28c1b34b32a7b925eb1104ee6])
* (edit) CHANGES.txt


> Store PDActions and triggers in file's metadata
> ---
>
> Key: TIKA-3887
> URL: https://issues.apache.org/jira/browse/TIKA-3887
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)

Ethan Wilansky created TIKA-3890:


 Summary: Identifying an efficient approach for getting page count 
prior to running an extraction
 Key: TIKA-3890
 URL: https://issues.apache.org/jira/browse/TIKA-3890
 Project: Tika
  Issue Type: Improvement
  Components: app
Affects Versions: 2.5.0
 Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
Docker container with 5.5GB reserved memory, 6GB limit
Tika config w/ 2GB reserved memory, 5GB limit 
Reporter: Ethan Wilansky


Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back a record size error or setting  
byteArrayMaxOverride to a large number to either return the text or metadata 
containing the page count. In both cases, this can take significant time to 
return a result.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wilansky updated TIKA-3890:
-
Description: 
Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back an array allocation error or setting  
byteArrayMaxOverride to a large number to return the text or metadata 
containing the page count. Returning a result other than the array allocation 
error can take significant time.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.

  was:
Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back a record size error or setting  
byteArrayMaxOverride to a large number to either return the text or metadata 
containing the page count. In both cases, this can take significant time to 
return a result.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.


> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika confi

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620610#comment-17620610
 ] 

Nick Burch commented on TIKA-3890:
--

The only way to be sure of how many pages are in a Word document is to render 
it (to screen / PDF / printer)

Some Word files get lucky and have a sensible number in the metadata set by 
Word from when it last opened the file and felt like populating statistics, but 
that's by no means always the case

If you're fairly sure your documents have sensible metadata, you could always 
pre-process with Apache POI. If you provide a File object and only read the 
metadata streams, it's pretty memory efficient to query

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620630#comment-17620630
 ] 

Ethan Wilansky commented on TIKA-3890:
--

Aha, I'll have to give Apache POI a try. Thanks Nick. It would be useful to get 
an extracted file size estimate. For example, the 8mb docx file generated a 
31MB text file. Is there a way in Tika to estimate extraction size beforehand?  

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620633#comment-17620633
 ] 

Nick Burch commented on TIKA-3890:
--

DOCX files are compressed XML. Text compresses very well. Already compressed 
images, audio, video don't.

An 8mb word document of pure text could fairly easily produce a 10x that in 
text. An 8mb word document that's mostly images could produce just a few bytes 
of text

DOCX-specific, you could open the file in POI (use a File to save memory), and 
check the size of the word XML stream and the size of any attachments, that'd 
give you a vague idea. However, it won't give you a complete answer as the word 
XML could have loads of complex stuff in it that doesn't end up with text 
output...

Easiest way to know the size of the output is just to parse it on a beefy 
machine with suitable restarts / respawning in place, and see what you get!

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [tika] dependabot[bot] opened a new pull request, #754: Bump icu4j from 62.2 to 72.1

dependabot[bot] opened a new pull request, #754:
URL: https://github.com/apache/tika/pull/754

Bumps [icu4j](https://github.com/unicode-org/icu) from 62.2 to 72.1.

Release notes
Sourced from https://github.com/unicode-org/icu/releases";>icu4j's
releases.

ICU 72.1
We are pleased to announce the release of Unicode® ICU 72. It updates to
https://blog.unicode.org/2022/09/announcing-unicode-standard-version-150.html";>Unicode
15, and to https://cldr.unicode.org/index/downloads/cldr-42";>CLDR
42 locale data with various additions and corrections.
ICU 72 and CLDR 42 are major releases, including a new version of Unicode
and major locale data improvements.
ICU 72 adds two technology preview implementations based on draft Unicode
specifications:

Formatting of people’s names in multiple languages (https://cldr.unicode.org/index/downloads/cldr-42#h.nrv6xq99qe7d";>CLDR
background on why this feature is being added and what it does)
An enhanced version of message formatting

This release also updates to the time zone data version 2022e (2022-oct).
Note that pre-1970 data for a number of time zones has been removed, as has
been the case in the upstream https://www.iana.org/time-zones";>tzdata release since 2021b.
For details, please see https://icu.unicode.org/download/72";>https://icu.unicode.org/download/72.
Note: The prebuilt WinARM64 binaries below should be considered
alpha/experimental.
ICU 72rc with CLDR beta3 / tzdata2022d
https://icu.unicode.org/download/72";>https://icu.unicode.org/download/72
ICU 72 RC
We are pleased to announce the release candidate for Unicode® ICU 72. It
updates to https://blog.unicode.org/2022/09/announcing-unicode-standard-version-150.html";>Unicode
15, and to https://cldr.unicode.org/index/downloads/cldr-42";>CLDR
42 locale data with various additions and corrections.
ICU 72 adds technology preview implementations for person name
formatting, as well as for a new version of message formatting based on a
proposed draft Unicode specification.
ICU 72 and CLDR 42 are major releases, including a new version of Unicode
and major locale data improvements.
ICU 72 updates to the time zone data version 2022b (2022-aug) which is
effectively the same as 2022c. Note that pre-1970 data for a number of time
zones has been removed, as has been the case in the upstreamhttps://www.iana.org/time-zones";> tzdata release since 2021b.
For details, please see https://icu.unicode.org/download/72";>https://icu.unicode.org/download/72.
Please test this release candidate on your platforms and report bugs and
regressions by Tuesday, 2022-oct-18, via the https://icu.unicode.org/contacts";>icu-support mailing list, and/or
please https://icu.unicode.org/bugs";>find/submit error reports.
Please do not use this release candidate in production.
The preliminary API reference documents are published on https://unicode-org.github.io/icu-docs/";>unicode-org.github.io/icu-docs/
– follow the “Dev” links there.
ICU 71.1
We are pleased to announce the release of Unicode® ICU 71.
ICU 71 updates to https://cldr.unicode.org/index/downloads/cldr-41";>CLDR 41 locale data
with various additions and corrections.
ICU 71 adds phrase-based line breaking for Japanese. Existing line
breaking methods follow standards and conventions for body text but do not work
well for short Japanese text, such as in titles and headings. This new feature
is optimized for these use cases.
ICU 71 adds support for Hindi written in Latin letters
(hi_Latn). The CLDR data for this increasingly popular locale has
been significantly revised and expanded. Note that based on user expectations,
hi_Latn incorporates a large amount of English, and can also be referred to as
“Hinglish”.
ICU 71 and CLDR 41 are minor releases, mostly focused on bug fixes and
small enhancements. (The fall CLDR/ICU releases will update to Unicode 15 which
is planned for September.) We are also working to re-establish continuous
performance testing for ICU, and on development towards future versions.
ICU 71 updates to the time zone data version 2022a. Note that pre-1970
data for a number of time zones has been removed, as has been the case in the
upstream https://www.iana.org/time-zones";>tzdata release since
2021b.
For details, please see https://icu.unicode.org/download/71";>https://icu.unicode.org/download/71.

... (truncated)

Commits

See full diff in https://github.com/unicode-org/icu/commits";>compare view

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.ibm.icu:icu4j&package-manager=maven&previous-version=62.2&new-version=72.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long

[GitHub] [tika] dependabot[bot] opened a new pull request, #755: Bump twelvemonkeys.version from 3.9.0 to 3.9.1



dependabot[bot] opened a new pull request, #755:
URL: https://github.com/apache/tika/pull/755

   Bumps `twelvemonkeys.version` from 3.9.0 to 3.9.1.
   Updates `common-io` from 3.9.0 to 3.9.1
   
   Updates `imageio-bmp` from 3.9.0 to 3.9.1
   
   Updates `imageio-jpeg` from 3.9.0 to 3.9.1
   
   Updates `imageio-psd` from 3.9.0 to 3.9.1
   
   Updates `imageio-tiff` from 3.9.0 to 3.9.1
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tika] dependabot[bot] opened a new pull request, #756: Bump aws.version from 1.12.323 to 1.12.324

dependabot[bot] opened a new pull request, #756:
URL: https://github.com/apache/tika/pull/756

Bumps `aws.version` from 1.12.323 to 1.12.324.
Updates `aws-java-sdk-transcribe` from 1.12.323 to 1.12.324

Changelog
Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>aws-java-sdk-transcribe's
changelog.

1.12.324 2022-10-19
AWS CloudTrail

Features

This release includes support for exporting CloudTrail Lake query
results to an Amazon S3 bucket.

AWS Config

Features

This release adds resourceType enums for AppConfig, AppSync, DataSync,
EC2, EKS, Glue, GuardDuty, SageMaker, ServiceDiscovery, SES, Route53 types.

AWS S3 Control

Features

Updates internal logic for constructing API endpoints. We have added
rule-based endpoints and internal model parameters.

AWS Support App

Features

This release adds the RegisterSlackWorkspaceForOrganization API. You can
use the API to register a Slack workspace for an AWS account that is part of an
organization.

Amazon Chime SDK Messaging

Features

Documentation updates for Chime Messaging SDK

Amazon Connect Service

Features

This release adds API support for managing phone numbers that can be
used across multiple AWS regions through telephony traffic distribution.

Amazon EventBridge

Features

Updates internal logic for constructing API endpoints. We have added
rule-based endpoints and internal model parameters.

Amazon Managed Blockchain

Features

Adding new Accessor APIs for Amazon Managed Blockchain

Amazon Simple Storage Service

Features

Updates internal logic for constructing API endpoints. We have added
rule-based endpoints and internal model parameters.

Amazon WorkSpaces Web

Features

WorkSpaces Web now supports user access logging for recording session
start, stop, and URL navigation.

Commits

https://github.com/aws/aws-sdk-java/commit/cbeaf1a74c54961982396782c05590862e5fef77";>cbeaf1a
AWS SDK for Java 1.12.324
https://github.com/aws/aws-sdk-java/commit/f5872749a14b8637612e3722beb07a4d8eb83084";>f587274
Update GitHub version number to 1.12.324-SNAPSHOT
See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.323...1.12.324";>compare
view

Updates `aws-java-sdk-s3` from 1.12.323 to 1.12.324

Changelog
Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>aws-java-sdk-s3's
changelog.

1.12.324 2022-10-19
AWS CloudTrail

Features

This release includes support for exporting CloudTrail Lake query
results to an Amazon S3 bucket.

AWS Config

Features

This release adds resourceType enums for AppConfig, AppSync, DataSync,
EC2, EKS, Glue, GuardDuty, SageMaker, ServiceDiscovery, SES, Route53 types.

AWS S3 Control

Features

Updates internal logic for constructing API endpoints. We have added
rule-based endpoints and internal model parameters.

AWS Support App

Features

This release adds the RegisterSlackWorkspaceForOrganization API. You can
use the API to register a Slack workspace for an AWS account that is part of an
organization.

Amazon Chime SDK Messaging

Features

Documentation updates for Chime Messaging SDK

Amazon Connect Service

Features

This release adds API support for managing phone numbers that can be
used across multiple AWS regions through telephony traffic distribution.

Amazon EventBridge

Features

Updates internal logic for constructing API endpoints. We have added
rule-based endpoints and internal model parameters.

Amazon Managed Blockchain

Features

Adding new Accessor APIs for Amazon Managed Blockchain

Amazon Simple Storage Service

Features

Updates internal logic for constructing API endpoints. We have added
rule-based endpoints and internal model parameters.

Amazon WorkSpaces Web

Features

WorkSpaces Web now supports user access logging for recording session
start, stop, and URL navigation.

Commits

Dependabot will resolve any conflicts with this PR as long as you don't
al

[GitHub] [tika] THausherr merged pull request #756: Bump aws.version from 1.12.323 to 1.12.324



THausherr merged PR #756:
URL: https://github.com/apache/tika/pull/756


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tika] THausherr merged pull request #755: Bump twelvemonkeys.version from 3.9.0 to 3.9.1



THausherr merged PR #755:
URL: https://github.com/apache/tika/pull/755


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tika] THausherr closed pull request #754: Bump icu4j from 62.2 to 72.1



THausherr closed pull request #754: Bump icu4j from 62.2 to 72.1
URL: https://github.com/apache/tika/pull/754


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tika] dependabot[bot] commented on pull request #754: Bump icu4j from 62.2 to 72.1