[jira] [Resolved] (TIKA-3885) Move AsyncProcessor's main to a new module
[ https://issues.apache.org/jira/browse/TIKA-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3885. --- Fix Version/s: 2.5.1 Resolution: Fixed > Move AsyncProcessor's main to a new module > -- > > Key: TIKA-3885 > URL: https://issues.apache.org/jira/browse/TIKA-3885 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > Fix For: 2.5.1 > > > On TIKA-3876, we added a main() to AsyncProcessor. It would be helpful to > move this functionality out of AsyncProcessor/tika-core into its own module. > That'll allow us to package logging etc and tika-core in a new module. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3886) Inject PDF annotation type into embedded files' metadata
Tim Allison created TIKA-3886: - Summary: Inject PDF annotation type into embedded files' metadata Key: TIKA-3886 URL: https://issues.apache.org/jira/browse/TIKA-3886 Project: Tika Issue Type: Improvement Reporter: Tim Allison In PDFs, embedded files may appear in annotations with different types, e.g. 3D. It would be helpful to associate the annotation types with the embedded files by adding a metadata item to the embedded file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3885) Move AsyncProcessor's main to a new module
[ https://issues.apache.org/jira/browse/TIKA-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620449#comment-17620449 ] Hudson commented on TIKA-3885: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #854 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/854/]) TIKA-3885: Add a tika-async-cli module (tallison: [https://github.com/apache/tika/commit/2b9ba8612b20d2779863f08b908e89cc001b483f]) * (edit) tika-bom/pom.xml * (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * (add) tika-pipes/tika-async-cli/src/main/java/org/apache/tika/async/cli/TikaAsyncCLI.java * (edit) CHANGES.txt * (edit) tika-app/pom.xml * (add) tika-pipes/tika-async-cli/pom.xml * (edit) pom.xml * (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java * (edit) tika-pipes/pom.xml > Move AsyncProcessor's main to a new module > -- > > Key: TIKA-3885 > URL: https://issues.apache.org/jira/browse/TIKA-3885 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > Fix For: 2.5.1 > > > On TIKA-3876, we added a main() to AsyncProcessor. It would be helpful to > move this functionality out of AsyncProcessor/tika-core into its own module. > That'll allow us to package logging etc and tika-core in a new module. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3887) Store PDActions and triggers in file's metadata
Tim Allison created TIKA-3887: - Summary: Store PDActions and triggers in file's metadata Key: TIKA-3887 URL: https://issues.apache.org/jira/browse/TIKA-3887 Project: Tika Issue Type: Improvement Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3888) to for checkstyle configs
Tim Allison created TIKA-3888: - Summary: to for checkstyle configs Key: TIKA-3888 URL: https://issues.apache.org/jira/browse/TIKA-3888 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3887) Store PDActions and triggers in file's metadata
[ https://issues.apache.org/jira/browse/TIKA-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3887. --- Fix Version/s: 2.5.1 Resolution: Fixed > Store PDActions and triggers in file's metadata > --- > > Key: TIKA-3887 > URL: https://issues.apache.org/jira/browse/TIKA-3887 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 2.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3888) to for checkstyle configs
[ https://issues.apache.org/jira/browse/TIKA-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3888. --- Fix Version/s: 2.5.1 Resolution: Fixed > to for checkstyle configs > > > Key: TIKA-3888 > URL: https://issues.apache.org/jira/browse/TIKA-3888 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 2.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs
Tim Allison created TIKA-3889: - Summary: Include counts of 3d objects in addition to current boolean has3D in PDFs Key: TIKA-3889 URL: https://issues.apache.org/jira/browse/TIKA-3889 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs
[ https://issues.apache.org/jira/browse/TIKA-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3889. --- Fix Version/s: 2.5.1 Resolution: Fixed > Include counts of 3d objects in addition to current boolean has3D in PDFs > - > > Key: TIKA-3889 > URL: https://issues.apache.org/jira/browse/TIKA-3889 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 2.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3888) to for checkstyle configs
[ https://issues.apache.org/jira/browse/TIKA-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620577#comment-17620577 ] Hudson commented on TIKA-3888: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/]) TIKA-3888 (tallison: [https://github.com/apache/tika/commit/54e2e77e3f5672cd77372716e78a2f237d1e3043]) * (edit) tika-parsers/pom.xml * (edit) tika-example/pom.xml * (edit) tika-serialization/pom.xml * (edit) tika-server/pom.xml * (edit) tika-langdetect/pom.xml * (edit) tika-batch/pom.xml * (edit) tika-pipes/pom.xml * (edit) tika-eval/pom.xml * (edit) tika-fuzzing/pom.xml * (edit) tika-core/pom.xml > to for checkstyle configs > > > Key: TIKA-3888 > URL: https://issues.apache.org/jira/browse/TIKA-3888 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 2.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3889) Include counts of 3d objects in addition to current boolean has3D in PDFs
[ https://issues.apache.org/jira/browse/TIKA-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620578#comment-17620578 ] Hudson commented on TIKA-3889: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/]) TIKA-3889 -- include counts of 3d objects (tallison: [https://github.com/apache/tika/commit/f6264c7044148f98dd733b9194a92918bb36bea7]) * (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java > Include counts of 3d objects in addition to current boolean has3D in PDFs > - > > Key: TIKA-3889 > URL: https://issues.apache.org/jira/browse/TIKA-3889 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 2.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3886) Inject PDF annotation type into embedded files' metadata
[ https://issues.apache.org/jira/browse/TIKA-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620579#comment-17620579 ] Hudson commented on TIKA-3886: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/]) TIKA-3886 -- extract annotationtype for embedded files in PDFs (tallison: [https://github.com/apache/tika/commit/fd474b6541cb397e9a1db4965b1725b1d9b5e241]) * (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java TIKA-3886 -- Extract PDF actions and triggers into the file's metadata (tallison: [https://github.com/apache/tika/commit/5062690cb18be20a6bde5b5e5e55755586c79ee2]) * (edit) tika-core/src/main/java/org/apache/tika/metadata/PDF.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java * (edit) CHANGES.txt > Inject PDF annotation type into embedded files' metadata > > > Key: TIKA-3886 > URL: https://issues.apache.org/jira/browse/TIKA-3886 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > > In PDFs, embedded files may appear in annotations with different types, e.g. > 3D. It would be helpful to associate the annotation types with the embedded > files by adding a metadata item to the embedded file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3887) Store PDActions and triggers in file's metadata
[ https://issues.apache.org/jira/browse/TIKA-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620580#comment-17620580 ] Hudson commented on TIKA-3887: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #856 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/856/]) TIKA-3887 -- Extract PDF actions and triggers into the file's metadata -- fix CHANGES.txt (tallison: [https://github.com/apache/tika/commit/291a74147c6999c28c1b34b32a7b925eb1104ee6]) * (edit) CHANGES.txt > Store PDActions and triggers in file's metadata > --- > > Key: TIKA-3887 > URL: https://issues.apache.org/jira/browse/TIKA-3887 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 2.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction
Ethan Wilansky created TIKA-3890: Summary: Identifying an efficient approach for getting page count prior to running an extraction Key: TIKA-3890 URL: https://issues.apache.org/jira/browse/TIKA-3890 Project: Tika Issue Type: Improvement Components: app Affects Versions: 2.5.0 Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores Docker container with 5.5GB reserved memory, 6GB limit Tika config w/ 2GB reserved memory, 5GB limit Reporter: Ethan Wilansky Tika is doing a great job with text extraction, until we encounter an Office document with an unreasonably large number of pages with extractable text. For example a Word document containing thousands of text pages. Unfortunately, we don't have an efficient way to determine page count before calling the /tika or /rmeta endpoints and either getting back a record size error or setting byteArrayMaxOverride to a large number to either return the text or metadata containing the page count. In both cases, this can take significant time to return a result. For example, this call: {{curl -T ./8mb.docx -H "Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" [http://localhost:9998/rmeta/ignore]}} {quote}{{with the configuration:}} {{}} {{}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ 17500}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ 12}} {{ }} {{ -Xms2000m}} {{ -Xmx5000m}} {{ }} {{ }} {{ }} {{}} {quote} returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds. Yes, I know this is a huge docx file and I don't want to process it. If I don't configure {{byteArrayMaxOverride}} I get this exception in just over a second: {{Tried to allocate an array of length 172,983,026, but the maximum length for this record type is 100,000,000.}} which is the preferred result. The exception is the preferred result. With that in mind, can you answer these questions? 1. Will other extractable file types that don't use the OfficeParser also throw the same array allocation error for very large text extractions? 2. Is there any way to correlate the array length returned to the number of lines or pages in the associated file to parse? 3. Is there an efficient way to calculate lines or pages of extractable content in a file before sending it for extraction? It doesn't appear that /rmeta with the /ignore path param significantly improves efficiency over calling the /tika endpoint or /rmeta w/out /igmore If its useful, I can share the 8MB docx file containing 14k pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wilansky updated TIKA-3890: - Description: Tika is doing a great job with text extraction, until we encounter an Office document with an unreasonably large number of pages with extractable text. For example a Word document containing thousands of text pages. Unfortunately, we don't have an efficient way to determine page count before calling the /tika or /rmeta endpoints and either getting back an array allocation error or setting byteArrayMaxOverride to a large number to return the text or metadata containing the page count. Returning a result other than the array allocation error can take significant time. For example, this call: {{curl -T ./8mb.docx -H "Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" [http://localhost:9998/rmeta/ignore]}} {quote}{{with the configuration:}} {{}} {{}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ 17500}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ 12}} {{ }} {{ -Xms2000m}} {{ -Xmx5000m}} {{ }} {{ }} {{ }} {{}} {quote} returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds. Yes, I know this is a huge docx file and I don't want to process it. If I don't configure {{byteArrayMaxOverride}} I get this exception in just over a second: {{Tried to allocate an array of length 172,983,026, but the maximum length for this record type is 100,000,000.}} which is the preferred result. The exception is the preferred result. With that in mind, can you answer these questions? 1. Will other extractable file types that don't use the OfficeParser also throw the same array allocation error for very large text extractions? 2. Is there any way to correlate the array length returned to the number of lines or pages in the associated file to parse? 3. Is there an efficient way to calculate lines or pages of extractable content in a file before sending it for extraction? It doesn't appear that /rmeta with the /ignore path param significantly improves efficiency over calling the /tika endpoint or /rmeta w/out /igmore If its useful, I can share the 8MB docx file containing 14k pages. was: Tika is doing a great job with text extraction, until we encounter an Office document with an unreasonably large number of pages with extractable text. For example a Word document containing thousands of text pages. Unfortunately, we don't have an efficient way to determine page count before calling the /tika or /rmeta endpoints and either getting back a record size error or setting byteArrayMaxOverride to a large number to either return the text or metadata containing the page count. In both cases, this can take significant time to return a result. For example, this call: {{curl -T ./8mb.docx -H "Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" [http://localhost:9998/rmeta/ignore]}} {quote}{{with the configuration:}} {{}} {{}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ 17500}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ 12}} {{ }} {{ -Xms2000m}} {{ -Xmx5000m}} {{ }} {{ }} {{ }} {{}} {quote} returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds. Yes, I know this is a huge docx file and I don't want to process it. If I don't configure {{byteArrayMaxOverride}} I get this exception in just over a second: {{Tried to allocate an array of length 172,983,026, but the maximum length for this record type is 100,000,000.}} which is the preferred result. The exception is the preferred result. With that in mind, can you answer these questions? 1. Will other extractable file types that don't use the OfficeParser also throw the same array allocation error for very large text extractions? 2. Is there any way to correlate the array length returned to the number of lines or pages in the associated file to parse? 3. Is there an efficient way to calculate lines or pages of extractable content in a file before sending it for extraction? It doesn't appear that /rmeta with the /ignore path param significantly improves efficiency over calling the /tika endpoint or /rmeta w/out /igmore If its useful, I can share the 8MB docx file containing 14k pages. > Identifying an efficient approach for getting page count prior to running an > extraction > --- > > Key: TIKA-3890 > URL: https://issues.apache.org/jira/browse/TIKA-3890 > Project: Tika > Issue Type: Improvement > Components: app >Affects Versions: 2.5.0 > Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores > Docker container with 5.5GB reserved memory, 6GB limit > Tika confi
[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620610#comment-17620610 ] Nick Burch commented on TIKA-3890: -- The only way to be sure of how many pages are in a Word document is to render it (to screen / PDF / printer) Some Word files get lucky and have a sensible number in the metadata set by Word from when it last opened the file and felt like populating statistics, but that's by no means always the case If you're fairly sure your documents have sensible metadata, you could always pre-process with Apache POI. If you provide a File object and only read the metadata streams, it's pretty memory efficient to query > Identifying an efficient approach for getting page count prior to running an > extraction > --- > > Key: TIKA-3890 > URL: https://issues.apache.org/jira/browse/TIKA-3890 > Project: Tika > Issue Type: Improvement > Components: app >Affects Versions: 2.5.0 > Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores > Docker container with 5.5GB reserved memory, 6GB limit > Tika config w/ 2GB reserved memory, 5GB limit >Reporter: Ethan Wilansky >Priority: Blocker > > Tika is doing a great job with text extraction, until we encounter an Office > document with an unreasonably large number of pages with extractable text. > For example a Word document containing thousands of text pages. > Unfortunately, we don't have an efficient way to determine page count before > calling the /tika or /rmeta endpoints and either getting back an array > allocation error or setting byteArrayMaxOverride to a large number to return > the text or metadata containing the page count. Returning a result other than > the array allocation error can take significant time. > For example, this call: > {{curl -T ./8mb.docx -H "Content-Type: > application/vnd.openxmlformats-officedocument.wordprocessingml.document" > [http://localhost:9998/rmeta/ignore]}} > {quote}{{with the configuration:}} > {{}} > {{}} > {{ }} > {{ }} > {{ class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}} > {{ class="org.apache.tika.parser.microsoft.OfficeParser"/>}} > {{ }} > {{ }} > {{ }} > {{ 17500}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ 12}} > {{ }} > {{ -Xms2000m}} > {{ -Xmx5000m}} > {{ }} > {{ }} > {{ }} > {{}} > {quote} > returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds. > Yes, I know this is a huge docx file and I don't want to process it. If I > don't configure {{byteArrayMaxOverride}} I get this exception in just over a > second: > {{Tried to allocate an array of length 172,983,026, but the maximum length > for this record type is 100,000,000.}} which is the preferred result. > The exception is the preferred result. With that in mind, can you answer > these questions? > 1. Will other extractable file types that don't use the OfficeParser also > throw the same array allocation error for very large text extractions? > 2. Is there any way to correlate the array length returned to the number of > lines or pages in the associated file to parse? > 3. Is there an efficient way to calculate lines or pages of extractable > content in a file before sending it for extraction? It doesn't appear that > /rmeta with the /ignore path param significantly improves efficiency over > calling the /tika endpoint or /rmeta w/out /igmore > If its useful, I can share the 8MB docx file containing 14k pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620630#comment-17620630 ] Ethan Wilansky commented on TIKA-3890: -- Aha, I'll have to give Apache POI a try. Thanks Nick. It would be useful to get an extracted file size estimate. For example, the 8mb docx file generated a 31MB text file. Is there a way in Tika to estimate extraction size beforehand? > Identifying an efficient approach for getting page count prior to running an > extraction > --- > > Key: TIKA-3890 > URL: https://issues.apache.org/jira/browse/TIKA-3890 > Project: Tika > Issue Type: Improvement > Components: app >Affects Versions: 2.5.0 > Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores > Docker container with 5.5GB reserved memory, 6GB limit > Tika config w/ 2GB reserved memory, 5GB limit >Reporter: Ethan Wilansky >Priority: Blocker > > Tika is doing a great job with text extraction, until we encounter an Office > document with an unreasonably large number of pages with extractable text. > For example a Word document containing thousands of text pages. > Unfortunately, we don't have an efficient way to determine page count before > calling the /tika or /rmeta endpoints and either getting back an array > allocation error or setting byteArrayMaxOverride to a large number to return > the text or metadata containing the page count. Returning a result other than > the array allocation error can take significant time. > For example, this call: > {{curl -T ./8mb.docx -H "Content-Type: > application/vnd.openxmlformats-officedocument.wordprocessingml.document" > [http://localhost:9998/rmeta/ignore]}} > {quote}{{with the configuration:}} > {{}} > {{}} > {{ }} > {{ }} > {{ class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}} > {{ class="org.apache.tika.parser.microsoft.OfficeParser"/>}} > {{ }} > {{ }} > {{ }} > {{ 17500}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ 12}} > {{ }} > {{ -Xms2000m}} > {{ -Xmx5000m}} > {{ }} > {{ }} > {{ }} > {{}} > {quote} > returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds. > Yes, I know this is a huge docx file and I don't want to process it. If I > don't configure {{byteArrayMaxOverride}} I get this exception in just over a > second: > {{Tried to allocate an array of length 172,983,026, but the maximum length > for this record type is 100,000,000.}} which is the preferred result. > The exception is the preferred result. With that in mind, can you answer > these questions? > 1. Will other extractable file types that don't use the OfficeParser also > throw the same array allocation error for very large text extractions? > 2. Is there any way to correlate the array length returned to the number of > lines or pages in the associated file to parse? > 3. Is there an efficient way to calculate lines or pages of extractable > content in a file before sending it for extraction? It doesn't appear that > /rmeta with the /ignore path param significantly improves efficiency over > calling the /tika endpoint or /rmeta w/out /igmore > If its useful, I can share the 8MB docx file containing 14k pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620633#comment-17620633 ] Nick Burch commented on TIKA-3890: -- DOCX files are compressed XML. Text compresses very well. Already compressed images, audio, video don't. An 8mb word document of pure text could fairly easily produce a 10x that in text. An 8mb word document that's mostly images could produce just a few bytes of text DOCX-specific, you could open the file in POI (use a File to save memory), and check the size of the word XML stream and the size of any attachments, that'd give you a vague idea. However, it won't give you a complete answer as the word XML could have loads of complex stuff in it that doesn't end up with text output... Easiest way to know the size of the output is just to parse it on a beefy machine with suitable restarts / respawning in place, and see what you get! > Identifying an efficient approach for getting page count prior to running an > extraction > --- > > Key: TIKA-3890 > URL: https://issues.apache.org/jira/browse/TIKA-3890 > Project: Tika > Issue Type: Improvement > Components: app >Affects Versions: 2.5.0 > Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores > Docker container with 5.5GB reserved memory, 6GB limit > Tika config w/ 2GB reserved memory, 5GB limit >Reporter: Ethan Wilansky >Priority: Blocker > > Tika is doing a great job with text extraction, until we encounter an Office > document with an unreasonably large number of pages with extractable text. > For example a Word document containing thousands of text pages. > Unfortunately, we don't have an efficient way to determine page count before > calling the /tika or /rmeta endpoints and either getting back an array > allocation error or setting byteArrayMaxOverride to a large number to return > the text or metadata containing the page count. Returning a result other than > the array allocation error can take significant time. > For example, this call: > {{curl -T ./8mb.docx -H "Content-Type: > application/vnd.openxmlformats-officedocument.wordprocessingml.document" > [http://localhost:9998/rmeta/ignore]}} > {quote}{{with the configuration:}} > {{}} > {{}} > {{ }} > {{ }} > {{ class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}} > {{ class="org.apache.tika.parser.microsoft.OfficeParser"/>}} > {{ }} > {{ }} > {{ }} > {{ 17500}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ 12}} > {{ }} > {{ -Xms2000m}} > {{ -Xmx5000m}} > {{ }} > {{ }} > {{ }} > {{}} > {quote} > returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds. > Yes, I know this is a huge docx file and I don't want to process it. If I > don't configure {{byteArrayMaxOverride}} I get this exception in just over a > second: > {{Tried to allocate an array of length 172,983,026, but the maximum length > for this record type is 100,000,000.}} which is the preferred result. > The exception is the preferred result. With that in mind, can you answer > these questions? > 1. Will other extractable file types that don't use the OfficeParser also > throw the same array allocation error for very large text extractions? > 2. Is there any way to correlate the array length returned to the number of > lines or pages in the associated file to parse? > 3. Is there an efficient way to calculate lines or pages of extractable > content in a file before sending it for extraction? It doesn't appear that > /rmeta with the /ignore path param significantly improves efficiency over > calling the /tika endpoint or /rmeta w/out /igmore > If its useful, I can share the 8MB docx file containing 14k pages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [tika] dependabot[bot] opened a new pull request, #754: Bump icu4j from 62.2 to 72.1
dependabot[bot] opened a new pull request, #754: URL: https://github.com/apache/tika/pull/754 Bumps [icu4j](https://github.com/unicode-org/icu) from 62.2 to 72.1. Release notes Sourced from https://github.com/unicode-org/icu/releases";>icu4j's releases. ICU 72.1 We are pleased to announce the release of Unicode® ICU 72. It updates to https://blog.unicode.org/2022/09/announcing-unicode-standard-version-150.html";>Unicode 15, and to https://cldr.unicode.org/index/downloads/cldr-42";>CLDR 42 locale data with various additions and corrections. ICU 72 and CLDR 42 are major releases, including a new version of Unicode and major locale data improvements. ICU 72 adds two technology preview implementations based on draft Unicode specifications: Formatting of people’s names in multiple languages (https://cldr.unicode.org/index/downloads/cldr-42#h.nrv6xq99qe7d";>CLDR background on why this feature is being added and what it does) An enhanced version of message formatting This release also updates to the time zone data version 2022e (2022-oct). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream https://www.iana.org/time-zones";>tzdata release since 2021b. For details, please see https://icu.unicode.org/download/72";>https://icu.unicode.org/download/72. Note: The prebuilt WinARM64 binaries below should be considered alpha/experimental. ICU 72rc with CLDR beta3 / tzdata2022d https://icu.unicode.org/download/72";>https://icu.unicode.org/download/72 ICU 72 RC We are pleased to announce the release candidate for Unicode® ICU 72. It updates to https://blog.unicode.org/2022/09/announcing-unicode-standard-version-150.html";>Unicode 15, and to https://cldr.unicode.org/index/downloads/cldr-42";>CLDR 42 locale data with various additions and corrections. ICU 72 adds technology preview implementations for person name formatting, as well as for a new version of message formatting based on a proposed draft Unicode specification. ICU 72 and CLDR 42 are major releases, including a new version of Unicode and major locale data improvements. ICU 72 updates to the time zone data version 2022b (2022-aug) which is effectively the same as 2022c. Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstreamhttps://www.iana.org/time-zones";> tzdata release since 2021b. For details, please see https://icu.unicode.org/download/72";>https://icu.unicode.org/download/72. Please test this release candidate on your platforms and report bugs and regressions by Tuesday, 2022-oct-18, via the https://icu.unicode.org/contacts";>icu-support mailing list, and/or please https://icu.unicode.org/bugs";>find/submit error reports. Please do not use this release candidate in production. The preliminary API reference documents are published on https://unicode-org.github.io/icu-docs/";>unicode-org.github.io/icu-docs/ – follow the “Dev” links there. ICU 71.1 We are pleased to announce the release of Unicode® ICU 71. ICU 71 updates to https://cldr.unicode.org/index/downloads/cldr-41";>CLDR 41 locale data with various additions and corrections. ICU 71 adds phrase-based line breaking for Japanese. Existing line breaking methods follow standards and conventions for body text but do not work well for short Japanese text, such as in titles and headings. This new feature is optimized for these use cases. ICU 71 adds support for Hindi written in Latin letters (hi_Latn). The CLDR data for this increasingly popular locale has been significantly revised and expanded. Note that based on user expectations, hi_Latn incorporates a large amount of English, and can also be referred to as “Hinglish”. ICU 71 and CLDR 41 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15 which is planned for September.) We are also working to re-establish continuous performance testing for ICU, and on development towards future versions. ICU 71 updates to the time zone data version 2022a. Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream https://www.iana.org/time-zones";>tzdata release since 2021b. For details, please see https://icu.unicode.org/download/71";>https://icu.unicode.org/download/71. ... (truncated) Commits See full diff in https://github.com/unicode-org/icu/commits";>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.ibm.icu:icu4j&package-manager=maven&previous-version=62.2&new-version=72.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long
[GitHub] [tika] dependabot[bot] opened a new pull request, #755: Bump twelvemonkeys.version from 3.9.0 to 3.9.1
dependabot[bot] opened a new pull request, #755: URL: https://github.com/apache/tika/pull/755 Bumps `twelvemonkeys.version` from 3.9.0 to 3.9.1. Updates `common-io` from 3.9.0 to 3.9.1 Updates `imageio-bmp` from 3.9.0 to 3.9.1 Updates `imageio-jpeg` from 3.9.0 to 3.9.1 Updates `imageio-psd` from 3.9.0 to 3.9.1 Updates `imageio-tiff` from 3.9.0 to 3.9.1 Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] dependabot[bot] opened a new pull request, #756: Bump aws.version from 1.12.323 to 1.12.324
dependabot[bot] opened a new pull request, #756: URL: https://github.com/apache/tika/pull/756 Bumps `aws.version` from 1.12.323 to 1.12.324. Updates `aws-java-sdk-transcribe` from 1.12.323 to 1.12.324 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>aws-java-sdk-transcribe's changelog. 1.12.324 2022-10-19 AWS CloudTrail Features This release includes support for exporting CloudTrail Lake query results to an Amazon S3 bucket. AWS Config Features This release adds resourceType enums for AppConfig, AppSync, DataSync, EC2, EKS, Glue, GuardDuty, SageMaker, ServiceDiscovery, SES, Route53 types. AWS S3 Control Features Updates internal logic for constructing API endpoints. We have added rule-based endpoints and internal model parameters. AWS Support App Features This release adds the RegisterSlackWorkspaceForOrganization API. You can use the API to register a Slack workspace for an AWS account that is part of an organization. Amazon Chime SDK Messaging Features Documentation updates for Chime Messaging SDK Amazon Connect Service Features This release adds API support for managing phone numbers that can be used across multiple AWS regions through telephony traffic distribution. Amazon EventBridge Features Updates internal logic for constructing API endpoints. We have added rule-based endpoints and internal model parameters. Amazon Managed Blockchain Features Adding new Accessor APIs for Amazon Managed Blockchain Amazon Simple Storage Service Features Updates internal logic for constructing API endpoints. We have added rule-based endpoints and internal model parameters. Amazon WorkSpaces Web Features WorkSpaces Web now supports user access logging for recording session start, stop, and URL navigation. Commits https://github.com/aws/aws-sdk-java/commit/cbeaf1a74c54961982396782c05590862e5fef77";>cbeaf1a AWS SDK for Java 1.12.324 https://github.com/aws/aws-sdk-java/commit/f5872749a14b8637612e3722beb07a4d8eb83084";>f587274 Update GitHub version number to 1.12.324-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.323...1.12.324";>compare view Updates `aws-java-sdk-s3` from 1.12.323 to 1.12.324 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>aws-java-sdk-s3's changelog. 1.12.324 2022-10-19 AWS CloudTrail Features This release includes support for exporting CloudTrail Lake query results to an Amazon S3 bucket. AWS Config Features This release adds resourceType enums for AppConfig, AppSync, DataSync, EC2, EKS, Glue, GuardDuty, SageMaker, ServiceDiscovery, SES, Route53 types. AWS S3 Control Features Updates internal logic for constructing API endpoints. We have added rule-based endpoints and internal model parameters. AWS Support App Features This release adds the RegisterSlackWorkspaceForOrganization API. You can use the API to register a Slack workspace for an AWS account that is part of an organization. Amazon Chime SDK Messaging Features Documentation updates for Chime Messaging SDK Amazon Connect Service Features This release adds API support for managing phone numbers that can be used across multiple AWS regions through telephony traffic distribution. Amazon EventBridge Features Updates internal logic for constructing API endpoints. We have added rule-based endpoints and internal model parameters. Amazon Managed Blockchain Features Adding new Accessor APIs for Amazon Managed Blockchain Amazon Simple Storage Service Features Updates internal logic for constructing API endpoints. We have added rule-based endpoints and internal model parameters. Amazon WorkSpaces Web Features WorkSpaces Web now supports user access logging for recording session start, stop, and URL navigation. Commits https://github.com/aws/aws-sdk-java/commit/cbeaf1a74c54961982396782c05590862e5fef77";>cbeaf1a AWS SDK for Java 1.12.324 https://github.com/aws/aws-sdk-java/commit/f5872749a14b8637612e3722beb07a4d8eb83084";>f587274 Update GitHub version number to 1.12.324-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.323...1.12.324";>compare view Dependabot will resolve any conflicts with this PR as long as you don't al
[GitHub] [tika] THausherr merged pull request #756: Bump aws.version from 1.12.323 to 1.12.324
THausherr merged PR #756: URL: https://github.com/apache/tika/pull/756 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] THausherr merged pull request #755: Bump twelvemonkeys.version from 3.9.0 to 3.9.1
THausherr merged PR #755: URL: https://github.com/apache/tika/pull/755 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] THausherr closed pull request #754: Bump icu4j from 62.2 to 72.1
THausherr closed pull request #754: Bump icu4j from 62.2 to 72.1 URL: https://github.com/apache/tika/pull/754 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] dependabot[bot] commented on pull request #754: Bump icu4j from 62.2 to 72.1
dependabot[bot] commented on PR #754: URL: https://github.com/apache/tika/pull/754#issuecomment-1285000675 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. You can also ignore all major, minor, or patch releases for a dependency by adding an [`ignore` condition](https://docs.github.com/en/code-security/supply-chain-security/configuration-options-for-dependency-updates#ignore) with the desired `update_types` to your config file. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org