[GitHub] [tika] dependabot[bot] opened a new pull request, #756: Bump aws.version from 1.12.323 to 1.12.324

2022-10-19 Thread GitBox


dependabot[bot] opened a new pull request, #756:
URL: https://github.com/apache/tika/pull/756

   Bumps `aws.version` from 1.12.323 to 1.12.324.
   Updates `aws-java-sdk-transcribe` from 1.12.323 to 1.12.324
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-transcribe's
 changelog.
   
   1.12.324 2022-10-19
   AWS CloudTrail
   
   
   Features
   
   This release includes support for exporting CloudTrail Lake query 
results to an Amazon S3 bucket.
   
   
   
   AWS Config
   
   
   Features
   
   This release adds resourceType enums for AppConfig, AppSync, DataSync, 
EC2, EKS, Glue, GuardDuty, SageMaker, ServiceDiscovery, SES, Route53 types.
   
   
   
   AWS S3 Control
   
   
   Features
   
   Updates internal logic for constructing API endpoints. We have added 
rule-based endpoints and internal model parameters.
   
   
   
   AWS Support App
   
   
   Features
   
   This release adds the RegisterSlackWorkspaceForOrganization API. You can 
use the API to register a Slack workspace for an AWS account that is part of an 
organization.
   
   
   
   Amazon Chime SDK Messaging
   
   
   Features
   
   Documentation updates for Chime Messaging SDK
   
   
   
   Amazon Connect Service
   
   
   Features
   
   This release adds API support for managing phone numbers that can be 
used across multiple AWS regions through telephony traffic distribution.
   
   
   
   Amazon EventBridge
   
   
   Features
   
   Updates internal logic for constructing API endpoints. We have added 
rule-based endpoints and internal model parameters.
   
   
   
   Amazon Managed Blockchain
   
   
   Features
   
   Adding new Accessor APIs for Amazon Managed Blockchain
   
   
   
   Amazon Simple Storage Service
   
   
   Features
   
   Updates internal logic for constructing API endpoints. We have added 
rule-based endpoints and internal model parameters.
   
   
   
   Amazon WorkSpaces Web
   
   
   Features
   
   WorkSpaces Web now supports user access logging for recording session 
start, stop, and URL navigation.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/cbeaf1a74c54961982396782c05590862e5fef77;>cbeaf1a
 AWS SDK for Java 1.12.324
   https://github.com/aws/aws-sdk-java/commit/f5872749a14b8637612e3722beb07a4d8eb83084;>f587274
 Update GitHub version number to 1.12.324-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.323...1.12.324;>compare 
view
   
   
   
   
   Updates `aws-java-sdk-s3` from 1.12.323 to 1.12.324
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-s3's
 changelog.
   
   1.12.324 2022-10-19
   AWS CloudTrail
   
   
   Features
   
   This release includes support for exporting CloudTrail Lake query 
results to an Amazon S3 bucket.
   
   
   
   AWS Config
   
   
   Features
   
   This release adds resourceType enums for AppConfig, AppSync, DataSync, 
EC2, EKS, Glue, GuardDuty, SageMaker, ServiceDiscovery, SES, Route53 types.
   
   
   
   AWS S3 Control
   
   
   Features
   
   Updates internal logic for constructing API endpoints. We have added 
rule-based endpoints and internal model parameters.
   
   
   
   AWS Support App
   
   
   Features
   
   This release adds the RegisterSlackWorkspaceForOrganization API. You can 
use the API to register a Slack workspace for an AWS account that is part of an 
organization.
   
   
   
   Amazon Chime SDK Messaging
   
   
   Features
   
   Documentation updates for Chime Messaging SDK
   
   
   
   Amazon Connect Service
   
   
   Features
   
   This release adds API support for managing phone numbers that can be 
used across multiple AWS regions through telephony traffic distribution.
   
   
   
   Amazon EventBridge
   
   
   Features
   
   Updates internal logic for constructing API endpoints. We have added 
rule-based endpoints and internal model parameters.
   
   
   
   Amazon Managed Blockchain
   
   
   Features
   
   Adding new Accessor APIs for Amazon Managed Blockchain
   
   
   
   Amazon Simple Storage Service
   
   
   Features
   
   Updates internal logic for constructing API endpoints. We have added 
rule-based endpoints and internal model parameters.
   
   
   
   Amazon WorkSpaces Web
   
   
   Features
   
   WorkSpaces Web now supports user access logging for recording session 
start, stop, and URL navigation.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/cbeaf1a74c54961982396782c05590862e5fef77;>cbeaf1a
 AWS SDK for Java 1.12.324
   https://github.com/aws/aws-sdk-java/commit/f5872749a14b8637612e3722beb07a4d8eb83084;>f587274
 Update GitHub version number to 1.12.324-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.323...1.12.324;>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it 

[GitHub] [tika] dependabot[bot] opened a new pull request, #755: Bump twelvemonkeys.version from 3.9.0 to 3.9.1

2022-10-19 Thread GitBox


dependabot[bot] opened a new pull request, #755:
URL: https://github.com/apache/tika/pull/755

   Bumps `twelvemonkeys.version` from 3.9.0 to 3.9.1.
   Updates `common-io` from 3.9.0 to 3.9.1
   
   Updates `imageio-bmp` from 3.9.0 to 3.9.1
   
   Updates `imageio-jpeg` from 3.9.0 to 3.9.1
   
   Updates `imageio-psd` from 3.9.0 to 3.9.1
   
   Updates `imageio-tiff` from 3.9.0 to 3.9.1
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] dependabot[bot] opened a new pull request, #754: Bump icu4j from 62.2 to 72.1

2022-10-19 Thread GitBox


dependabot[bot] opened a new pull request, #754:
URL: https://github.com/apache/tika/pull/754

   Bumps [icu4j](https://github.com/unicode-org/icu) from 62.2 to 72.1.
   
   Release notes
   Sourced from https://github.com/unicode-org/icu/releases;>icu4j's 
releases.
   
   ICU 72.1
   We are pleased to announce the release of Unicode® ICU 72. It updates to 
https://blog.unicode.org/2022/09/announcing-unicode-standard-version-150.html;>Unicode
 15, and to https://cldr.unicode.org/index/downloads/cldr-42;>CLDR 
42 locale data with various additions and corrections.
   ICU 72 and CLDR 42 are major releases, including a new version of Unicode 
and major locale data improvements.
   ICU 72 adds two technology preview implementations based on draft Unicode 
specifications:
   
   Formatting of people’s names in multiple languages (https://cldr.unicode.org/index/downloads/cldr-42#h.nrv6xq99qe7d;>CLDR 
background on why this feature is being added and what it does)
   An enhanced version of message formatting
   
   This release also updates to the time zone data version 2022e (2022-oct). 
Note that pre-1970 data for a number of time zones has been removed, as has 
been the case in the upstream https://www.iana.org/time-zones;>tzdata release since 2021b.
   For details, please see https://icu.unicode.org/download/72;>https://icu.unicode.org/download/72.
   Note: The prebuilt WinARM64 binaries below should be considered 
alpha/experimental.
   ICU 72rc with CLDR beta3 / tzdata2022d
   https://icu.unicode.org/download/72;>https://icu.unicode.org/download/72
   ICU 72 RC
   We are pleased to announce the release candidate for Unicode® ICU 72. It 
updates to https://blog.unicode.org/2022/09/announcing-unicode-standard-version-150.html;>Unicode
 15, and to https://cldr.unicode.org/index/downloads/cldr-42;>CLDR 
42 locale data with various additions and corrections.
   ICU 72 adds technology preview implementations for person name 
formatting, as well as for a new version of message formatting based on a 
proposed draft Unicode specification.
   ICU 72 and CLDR 42 are major releases, including a new version of Unicode 
and major locale data improvements.
   ICU 72 updates to the time zone data version 2022b (2022-aug) which is 
effectively the same as 2022c. Note that pre-1970 data for a number of time 
zones has been removed, as has been the case in the upstreamhttps://www.iana.org/time-zones;> tzdata release since 2021b.
   For details, please see https://icu.unicode.org/download/72;>https://icu.unicode.org/download/72.
   Please test this release candidate on your platforms and report bugs and 
regressions by Tuesday, 2022-oct-18, via the https://icu.unicode.org/contacts;>icu-support mailing list, and/or 
please https://icu.unicode.org/bugs;>find/submit error reports.
   Please do not use this release candidate in production.
   The preliminary API reference documents are published on https://unicode-org.github.io/icu-docs/;>unicode-org.github.io/icu-docs/
 – follow the “Dev” links there.
   ICU 71.1
   We are pleased to announce the release of Unicode® ICU 71.
   ICU 71 updates to https://cldr.unicode.org/index/downloads/cldr-41;>CLDR 41 locale data 
with various additions and corrections.
   ICU 71 adds phrase-based line breaking for Japanese. Existing line 
breaking methods follow standards and conventions for body text but do not work 
well for short Japanese text, such as in titles and headings. This new feature 
is optimized for these use cases.
   ICU 71 adds support for Hindi written in Latin letters 
(hi_Latn). The CLDR data for this increasingly popular locale has 
been significantly revised and expanded. Note that based on user expectations, 
hi_Latn incorporates a large amount of English, and can also be referred to as 
“Hinglish”.
   ICU 71 and CLDR 41 are minor releases, mostly focused on bug fixes and 
small enhancements. (The fall CLDR/ICU releases will update to Unicode 15 which 
is planned for September.) We are also working to re-establish continuous 
performance testing for ICU, and on development towards future versions.
   ICU 71 updates to the time zone data version 2022a. Note that pre-1970 
data for a number of time zones has been removed, as has been the case in the 
upstream https://www.iana.org/time-zones;>tzdata release since 
2021b.
   For details, please see https://icu.unicode.org/download/71;>https://icu.unicode.org/download/71.
   
   
   ... (truncated)
   
   
   Commits
   
   See full diff in https://github.com/unicode-org/icu/commits;>compare view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.ibm.icu:icu4j=maven=62.2=72.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a 

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620633#comment-17620633
 ] 

Nick Burch commented on TIKA-3890:
--

DOCX files are compressed XML. Text compresses very well. Already compressed 
images, audio, video don't.

An 8mb word document of pure text could fairly easily produce a 10x that in 
text. An 8mb word document that's mostly images could produce just a few bytes 
of text

DOCX-specific, you could open the file in POI (use a File to save memory), and 
check the size of the word XML stream and the size of any attachments, that'd 
give you a vague idea. However, it won't give you a complete answer as the word 
XML could have loads of complex stuff in it that doesn't end up with text 
output...

Easiest way to know the size of the output is just to parse it on a beefy 
machine with suitable restarts / respawning in place, and see what you get!

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620630#comment-17620630
 ] 

Ethan Wilansky commented on TIKA-3890:
--

Aha, I'll have to give Apache POI a try. Thanks Nick. It would be useful to get 
an extracted file size estimate. For example, the 8mb docx file generated a 
31MB text file. Is there a way in Tika to estimate extraction size beforehand?  

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620610#comment-17620610
 ] 

Nick Burch commented on TIKA-3890:
--

The only way to be sure of how many pages are in a Word document is to render 
it (to screen / PDF / printer)

Some Word files get lucky and have a sensible number in the metadata set by 
Word from when it last opened the file and felt like populating statistics, but 
that's by no means always the case

If you're fairly sure your documents have sensible metadata, you could always 
pre-process with Apache POI. If you provide a File object and only read the 
metadata streams, it's pretty memory efficient to query

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>Reporter: Ethan Wilansky
>Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{}}
> {{}}
> {{  }}
> {{    }}
> {{       class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{       class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    }}
> {{    }}
> {{      }}
> {{        17500}}
> {{      }}
> {{    }}
> {{  }}
> {{  }}
> {{    }}
> {{      12}}
> {{      }}
> {{        -Xms2000m}}
> {{        -Xmx5000m}}
> {{      }}
> {{    }}
> {{  }}
> {{}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wilansky updated TIKA-3890:
-
Description: 
Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back an array allocation error or setting  
byteArrayMaxOverride to a large number to return the text or metadata 
containing the page count. Returning a result other than the array allocation 
error can take significant time.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.

  was:
Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back a record size error or setting  
byteArrayMaxOverride to a large number to either return the text or metadata 
containing the page count. In both cases, this can take significant time to 
return a result.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.


> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
>  Issue Type: Improvement
>  Components: app
>Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika 

[jira] [Created] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Ethan Wilansky (Jira)
Ethan Wilansky created TIKA-3890:


 Summary: Identifying an efficient approach for getting page count 
prior to running an extraction
 Key: TIKA-3890
 URL: https://issues.apache.org/jira/browse/TIKA-3890
 Project: Tika
  Issue Type: Improvement
  Components: app
Affects Versions: 2.5.0
 Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
Docker container with 5.5GB reserved memory, 6GB limit
Tika config w/ 2GB reserved memory, 5GB limit 
Reporter: Ethan Wilansky


Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back a record size error or setting  
byteArrayMaxOverride to a large number to either return the text or metadata 
containing the page count. In both cases, this can take significant time to 
return a result.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{}}
{{}}
{{  }}
{{    }}
{{      }}
{{      }}
{{    }}
{{    }}
{{      }}
{{        17500}}
{{      }}
{{    }}
{{  }}
{{  }}
{{    }}
{{      12}}
{{      }}
{{        -Xms2000m}}
{{        -Xmx5000m}}
{{      }}
{{    }}
{{  }}
{{}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3887) Store PDActions and triggers in file's metadata

2022-10-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620580#comment-17620580
 ] 

Hudson commented on TIKA-3887:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3887 -- Extract PDF actions and triggers into the file's metadata -- fix 
CHANGES.txt (tallison: 
[https://github.com/apache/tika/commit/291a74147c6999c28c1b34b32a7b925eb1104ee6])
* (edit) CHANGES.txt


> Store PDActions and triggers in file's metadata
> ---
>
> Key: TIKA-3887
> URL: https://issues.apache.org/jira/browse/TIKA-3887
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs

2022-10-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620578#comment-17620578
 ] 

Hudson commented on TIKA-3889:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3889 -- include counts of 3d objects (tallison: 
[https://github.com/apache/tika/commit/f6264c7044148f98dd733b9194a92918bb36bea7])
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java


> Include counts of 3d objects in addition to current boolean has3D in PDFs
> -
>
> Key: TIKA-3889
> URL: https://issues.apache.org/jira/browse/TIKA-3889
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3886) Inject PDF annotation type into embedded files' metadata

2022-10-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620579#comment-17620579
 ] 

Hudson commented on TIKA-3886:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3886 -- extract annotationtype for embedded files in PDFs (tallison: 
[https://github.com/apache/tika/commit/fd474b6541cb397e9a1db4965b1725b1d9b5e241])
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
TIKA-3886 -- Extract PDF actions and triggers into the file's metadata 
(tallison: 
[https://github.com/apache/tika/commit/5062690cb18be20a6bde5b5e5e55755586c79ee2])
* (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) CHANGES.txt


> Inject PDF annotation type into embedded files' metadata
> 
>
> Key: TIKA-3886
> URL: https://issues.apache.org/jira/browse/TIKA-3886
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>
> In PDFs, embedded files may appear in annotations with different types, e.g. 
> 3D.  It would be helpful to associate the annotation types with the embedded 
> files by adding a metadata item to the embedded file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3888) to for checkstyle configs

2022-10-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620577#comment-17620577
 ] 

Hudson commented on TIKA-3888:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/])
TIKA-3888 (tallison: 
[https://github.com/apache/tika/commit/54e2e77e3f5672cd77372716e78a2f237d1e3043])
* (edit) tika-parsers/pom.xml
* (edit) tika-example/pom.xml
* (edit) tika-serialization/pom.xml
* (edit) tika-server/pom.xml
* (edit) tika-langdetect/pom.xml
* (edit) tika-batch/pom.xml
* (edit) tika-pipes/pom.xml
* (edit) tika-eval/pom.xml
* (edit) tika-fuzzing/pom.xml
* (edit) tika-core/pom.xml


>  to  for checkstyle configs
> 
>
> Key: TIKA-3888
> URL: https://issues.apache.org/jira/browse/TIKA-3888
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs

2022-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3889.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

> Include counts of 3d objects in addition to current boolean has3D in PDFs
> -
>
> Key: TIKA-3889
> URL: https://issues.apache.org/jira/browse/TIKA-3889
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs

2022-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-3889:
-

 Summary: Include counts of 3d objects in addition to current 
boolean has3D in PDFs
 Key: TIKA-3889
 URL: https://issues.apache.org/jira/browse/TIKA-3889
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3888) to for checkstyle configs

2022-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3888.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

>  to  for checkstyle configs
> 
>
> Key: TIKA-3888
> URL: https://issues.apache.org/jira/browse/TIKA-3888
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3887) Store PDActions and triggers in file's metadata

2022-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3887.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

> Store PDActions and triggers in file's metadata
> ---
>
> Key: TIKA-3887
> URL: https://issues.apache.org/jira/browse/TIKA-3887
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3888) to for checkstyle configs

2022-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-3888:
-

 Summary:  to  for checkstyle configs
 Key: TIKA-3888
 URL: https://issues.apache.org/jira/browse/TIKA-3888
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3887) Store PDActions and triggers in file's metadata

2022-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-3887:
-

 Summary: Store PDActions and triggers in file's metadata
 Key: TIKA-3887
 URL: https://issues.apache.org/jira/browse/TIKA-3887
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3885) Move AsyncProcessor's main to a new module

2022-10-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620449#comment-17620449
 ] 

Hudson commented on TIKA-3885:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #854 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/854/])
TIKA-3885: Add a tika-async-cli module (tallison: 
[https://github.com/apache/tika/commit/2b9ba8612b20d2779863f08b908e89cc001b483f])
* (edit) tika-bom/pom.xml
* (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* (add) 
tika-pipes/tika-async-cli/src/main/java/org/apache/tika/async/cli/TikaAsyncCLI.java
* (edit) CHANGES.txt
* (edit) tika-app/pom.xml
* (add) tika-pipes/tika-async-cli/pom.xml
* (edit) pom.xml
* (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java
* (edit) tika-pipes/pom.xml


> Move AsyncProcessor's main to a new module
> --
>
> Key: TIKA-3885
> URL: https://issues.apache.org/jira/browse/TIKA-3885
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.5.1
>
>
> On TIKA-3876, we added a main() to AsyncProcessor. It would be helpful to 
> move this functionality out of AsyncProcessor/tika-core into its own module.  
> That'll allow us to package logging etc and tika-core in a new module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3886) Inject PDF annotation type into embedded files' metadata

2022-10-19 Thread Tim Allison (Jira)
Tim Allison created TIKA-3886:
-

 Summary: Inject PDF annotation type into embedded files' metadata
 Key: TIKA-3886
 URL: https://issues.apache.org/jira/browse/TIKA-3886
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


In PDFs, embedded files may appear in annotations with different types, e.g. 
3D.  It would be helpful to associate the annotation types with the embedded 
files by adding a metadata item to the embedded file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3885) Move AsyncProcessor's main to a new module

2022-10-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3885.
---
Fix Version/s: 2.5.1
   Resolution: Fixed

> Move AsyncProcessor's main to a new module
> --
>
> Key: TIKA-3885
> URL: https://issues.apache.org/jira/browse/TIKA-3885
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.5.1
>
>
> On TIKA-3876, we added a main() to AsyncProcessor. It would be helpful to 
> move this functionality out of AsyncProcessor/tika-core into its own module.  
> That'll allow us to package logging etc and tika-core in a new module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] THausherr merged pull request #752: Bump protobuf-java from 3.21.7 to 3.21.8

2022-10-19 Thread GitBox


THausherr merged PR #752:
URL: https://github.com/apache/tika/pull/752


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] THausherr merged pull request #753: Bump aws.version from 1.12.322 to 1.12.323

2022-10-19 Thread GitBox


THausherr merged PR #753:
URL: https://github.com/apache/tika/pull/753


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org