Re: [PR] Bump aws.version from 1.12.686 to 1.12.687 [tika]

2024-03-25 Thread via GitHub


THausherr merged PR #1692:
URL: https://github.com/apache/tika/pull/1692


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Bump aws.version from 1.12.686 to 1.12.687 [tika]

2024-03-25 Thread via GitHub


dependabot[bot] opened a new pull request, #1692:
URL: https://github.com/apache/tika/pull/1692

   Bumps `aws.version` from 1.12.686 to 1.12.687.
   Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.686 to 1.12.687
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-s3's
 changelog.
   
   1.12.687 2024-03-25
   AWS CodeBuild
   
   
   Features
   
   Supporting GitLab and GitLab Self Managed as source types in AWS 
CodeBuild.
   
   
   
   AWS Elemental MediaLive
   
   
   Features
   
   Exposing TileMedia H265 options
   
   
   
   AWS Global Accelerator
   
   
   Features
   
   AWS Global Accelerator now supports cross-account sharing for bring your 
own IP addresses.
   
   
   
   Amazon EC2 Container Service
   
   
   Features
   
   Documentation only update for Amazon ECS.
   
   
   
   Amazon EMR Containers
   
   
   Features
   
   This release increases the number of supported job template parameters 
from 20 to 100.
   
   
   
   Amazon Elastic Compute Cloud
   
   
   Features
   
   Added support for ModifyInstanceMetadataDefaults and 
GetInstanceMetadataDefaults to set Instance Metadata Service account 
defaults
   
   
   
   Amazon SageMaker Service
   
   
   Features
   
   Introduced support for the following new instance types on SageMaker 
Studio for JupyterLab and CodeEditor applications: m6i, m6id, m7i, c6i, c6id, 
c7i, r6i, r6id, r7i, and p5
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/6b5d23e9ca756c65f5e119ec7dbaacc6eff1327c;>6b5d23e
 AWS SDK for Java 1.12.687
   https://github.com/aws/aws-sdk-java/commit/cdc9f1ceb3e71314e9e05153961a394b118c93ac;>cdc9f1c
 Update GitHub version number to 1.12.687-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.686...1.12.687;>compare 
view
   
   
   
   
   Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.686 to 1.12.687
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-transcribe's
 changelog.
   
   1.12.687 2024-03-25
   AWS CodeBuild
   
   
   Features
   
   Supporting GitLab and GitLab Self Managed as source types in AWS 
CodeBuild.
   
   
   
   AWS Elemental MediaLive
   
   
   Features
   
   Exposing TileMedia H265 options
   
   
   
   AWS Global Accelerator
   
   
   Features
   
   AWS Global Accelerator now supports cross-account sharing for bring your 
own IP addresses.
   
   
   
   Amazon EC2 Container Service
   
   
   Features
   
   Documentation only update for Amazon ECS.
   
   
   
   Amazon EMR Containers
   
   
   Features
   
   This release increases the number of supported job template parameters 
from 20 to 100.
   
   
   
   Amazon Elastic Compute Cloud
   
   
   Features
   
   Added support for ModifyInstanceMetadataDefaults and 
GetInstanceMetadataDefaults to set Instance Metadata Service account 
defaults
   
   
   
   Amazon SageMaker Service
   
   
   Features
   
   Introduced support for the following new instance types on SageMaker 
Studio for JupyterLab and CodeEditor applications: m6i, m6id, m7i, c6i, c6id, 
c7i, r6i, r6id, r7i, and p5
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/6b5d23e9ca756c65f5e119ec7dbaacc6eff1327c;>6b5d23e
 AWS SDK for Java 1.12.687
   https://github.com/aws/aws-sdk-java/commit/cdc9f1ceb3e71314e9e05153961a394b118c93ac;>cdc9f1c
 Update GitHub version number to 1.12.687-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.686...1.12.687;>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close 

[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830723#comment-17830723
 ] 

Tim Allison commented on TIKA-4223:
---

Maybe? This suggests one of the ms pki cert family? 
https://help.sap.com/docs/CX_NG_SALES/ea5ff8b9460a43cb8765a3c07d3421fe/7b2aeb2b2a9446259246e0ff15a823c4.html

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4222) Add detection for OpenSCAD

2024-03-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830720#comment-17830720
 ] 

Hudson commented on TIKA-4222:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1573 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1573/])
TIKA-4222 -- add openscad glob (#1690) (github: 
[https://github.com/apache/tika/commit/c5693624cbd43d0d76357b9f21705991d6f3a4ff])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Add detection for OpenSCAD
> --
>
> Key: TIKA-4222
> URL: https://issues.apache.org/jira/browse/TIKA-4222
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
>
> OpenSCAD (https://openscad.org/index.html) is a 3D modeller based on a custom 
> script language. The files are currently detected as text/plain.
>  
>  
> Examples can be found here: 
> https://github.com/openscad/openscad/tree/master/examples



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830721#comment-17830721
 ] 

Hudson commented on TIKA-4224:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1573 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1573/])
TIKA-4224 -- add detection for 3mf (#1689) (github: 
[https://github.com/apache/tika/commit/3ffbc04f7a1023aa8e6d5ea22d19feb2a7e61a8f])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/test3mf.3mf
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java


> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread Robin Schimpf (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830707#comment-17830707
 ] 

Robin Schimpf commented on TIKA-4223:
-

application/vnd.ms-pki.stl might just be an alias (or older mime type) for the 
binary STL format. Found this site 
([https://www.westaflex.com/support/dokumente/Dichtung)] where the file is 
listed with the mime type. Downloading and inspecting it it is the binary STL 
format.

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4222 -- add glob based detection for openscad [tika]

2024-03-25 Thread via GitHub


tballison merged PR #1690:
URL: https://github.com/apache/tika/pull/1690


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830701#comment-17830701
 ] 

ASF GitHub Bot commented on TIKA-4224:
--

tballison merged PR #1689:
URL: https://github.com/apache/tika/pull/1689




> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830700#comment-17830700
 ] 

ASF GitHub Bot commented on TIKA-4223:
--

tballison opened a new pull request, #1691:
URL: https://github.com/apache/tika/pull/1691

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4222) Add detection for OpenSCAD

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830702#comment-17830702
 ] 

ASF GitHub Bot commented on TIKA-4222:
--

tballison merged PR #1690:
URL: https://github.com/apache/tika/pull/1690




> Add detection for OpenSCAD
> --
>
> Key: TIKA-4222
> URL: https://issues.apache.org/jira/browse/TIKA-4222
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
>
> OpenSCAD (https://openscad.org/index.html) is a 3D modeller based on a custom 
> script language. The files are currently detected as text/plain.
>  
>  
> Examples can be found here: 
> https://github.com/openscad/openscad/tree/master/examples



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4224 -- create detection for 3mf [tika]

2024-03-25 Thread via GitHub


tballison merged PR #1689:
URL: https://github.com/apache/tika/pull/1689


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] TIKA-4223 -- add detection of stl [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1691:
URL: https://github.com/apache/tika/pull/1691

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830698#comment-17830698
 ] 

Hudson commented on TIKA-4225:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1572 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1572/])
TIKA-4225 -- add detection for amf (#1688) (github: 
[https://github.com/apache/tika/commit/36e3ba8cd6f489be1241536661f6f1821458b902])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4219) Figure out what to do with epubs with encrypted non-core content

2024-03-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830678#comment-17830678
 ] 

Hudson commented on TIKA-4219:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1571 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1571/])
TIKA-4219 -- improve epub handling of encrypted non-text-containing items 
(#1684) (tallison: 
[https://github.com/apache/tika/commit/a559906db468c14f6d7c3dae8b657ddaab4a1733])
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/epub/EncryptionParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/epub/EpubParser.java


> Figure out what to do with epubs with encrypted non-core content
> 
>
> Key: TIKA-4219
> URL: https://issues.apache.org/jira/browse/TIKA-4219
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> On TIKA-4218, we noticed several epubs that were now being identified as 
> encrypted, which is good. We did this work on TIKA-4176.
> On the other hand, we found several epubs that were now identified as 
> encrypted but which had content before we were doing the encryption detection.
> The issue in at least one file that I reviewed is that non-core content is 
> encrypted -- the fonts. So, from a text+metadata extraction, we could still 
> get all the content and then throw an Encrypted Exception or maybe flag 
> something as encrypted.
> I'm not sure what the best thing to do is in this case.
> An example file is here: 
> http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830677#comment-17830677
 ] 

Hudson commented on TIKA-4171:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1571 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1571/])
TIKA-4171 -- fix regression when field names are missing in the XFAExtractor 
(#1679) (tallison: 
[https://github.com/apache/tika/commit/b9ab4813ed16f53a0bf3aa61883da2cebdf7f3a1])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java


> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830675#comment-17830675
 ] 

Tim Allison commented on TIKA-4223:
---

Even worse, there are two other file formats that can use *.stl. And Tika does 
not allow for more than one file type per glob. :( :(

application/x-ebu-stl (this at least has magic)
application/vnd.ms-pki.stl (we don't currently have magic for this one...don't 
know if it exists).

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830671#comment-17830671
 ] 

Tim Allison commented on TIKA-4223:
---

Yikes, y, no magic for the binary. :( 
https://www.loc.gov/preservation/digital/formats/fdd/fdd000505.shtml
http://formats.kaitai.io/stl/index.html

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4222) Add detection for OpenSCAD

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830669#comment-17830669
 ] 

ASF GitHub Bot commented on TIKA-4222:
--

tballison opened a new pull request, #1690:
URL: https://github.com/apache/tika/pull/1690

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add detection for OpenSCAD
> --
>
> Key: TIKA-4222
> URL: https://issues.apache.org/jira/browse/TIKA-4222
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
>
> OpenSCAD (https://openscad.org/index.html) is a 3D modeller based on a custom 
> script language. The files are currently detected as text/plain.
>  
>  
> Examples can be found here: 
> https://github.com/openscad/openscad/tree/master/examples



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4222 -- add glob based detection for openscad [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1690:
URL: https://github.com/apache/tika/pull/1690

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4222) Add detection for OpenSCAD

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830668#comment-17830668
 ] 

Tim Allison commented on TIKA-4222:
---

Will use: application/x-openscad ? Based on: 
https://github.com/openscad/openscad/issues/647

> Add detection for OpenSCAD
> --
>
> Key: TIKA-4222
> URL: https://issues.apache.org/jira/browse/TIKA-4222
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
>
> OpenSCAD (https://openscad.org/index.html) is a 3D modeller based on a custom 
> script language. The files are currently detected as text/plain.
>  
>  
> Examples can be found here: 
> https://github.com/openscad/openscad/tree/master/examples



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread Robin Schimpf (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830664#comment-17830664
 ] 

Robin Schimpf commented on TIKA-4224:
-

Ah ok. Skipped the OPC part. Mime Type is fine for me.

> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830661#comment-17830661
 ] 

ASF GitHub Bot commented on TIKA-4224:
--

tballison commented on PR #1689:
URL: https://github.com/apache/tika/pull/1689#issuecomment-2018700354

   Converted to draft until there's agreement on the mime type.




> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830663#comment-17830663
 ] 

ASF GitHub Bot commented on TIKA-4225:
--

tballison merged PR #1688:
URL: https://github.com/apache/tika/pull/1688




> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4225 -- add detection for amf [tika]

2024-03-25 Thread via GitHub


tballison merged PR #1688:
URL: https://github.com/apache/tika/pull/1688


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4224 -- create detection for 3mf [tika]

2024-03-25 Thread via GitHub


tballison commented on PR #1689:
URL: https://github.com/apache/tika/pull/1689#issuecomment-2018700354

   Converted to draft until there's agreement on the mime type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830660#comment-17830660
 ] 

ASF GitHub Bot commented on TIKA-4224:
--

tballison opened a new pull request, #1689:
URL: https://github.com/apache/tika/pull/1689

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4224 -- create detection for 3mf [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1689:
URL: https://github.com/apache/tika/pull/1689

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830655#comment-17830655
 ] 

Tim Allison commented on TIKA-4224:
---

The file correctly loads as an OPCPackage, as according to the spec. There's 
one relationship: http://schemas.microsoft.com/3dmanufacturing/2013/01/3dmodel 
which is also listed as "required" in the spec. It doesn't say it explicitly, 
but it does say that the model part is required, and the way you specify the 
model part is as a relationship with that URL.

> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830656#comment-17830656
 ] 

Tim Allison commented on TIKA-4224:
---

Any objections to using 
"application/vnd.ms-package.3dmanufacturing-3dmodel+xml" as the mime type?

> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830648#comment-17830648
 ] 

ASF GitHub Bot commented on TIKA-4225:
--

tballison commented on code in PR #1688:
URL: https://github.com/apache/tika/pull/1688#discussion_r1538062689


##
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml:
##
@@ -3286,6 +3286,12 @@
 
 
   
+  
+
https://en.wikipedia.org/wiki/Additive_manufacturing_file_format
+
+

Review Comment:
   Too many tabs open. Fixed. Thank you.





> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830647#comment-17830647
 ] 

ASF GitHub Bot commented on TIKA-4225:
--

tballison commented on code in PR #1688:
URL: https://github.com/apache/tika/pull/1688#discussion_r1538062689


##
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml:
##
@@ -3286,6 +3286,12 @@
 
 
   
+  
+
https://en.wikipedia.org/wiki/Additive_manufacturing_file_format
+
+

Review Comment:
   https://en.wikipedia.org/wiki/Additive_manufacturing_file_format Fixed. 
Thank you.





> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4225 -- add detection for amf [tika]

2024-03-25 Thread via GitHub


tballison commented on code in PR #1688:
URL: https://github.com/apache/tika/pull/1688#discussion_r1538062689


##
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml:
##
@@ -3286,6 +3286,12 @@
 
 
   
+  
+
https://en.wikipedia.org/wiki/Additive_manufacturing_file_format
+
+

Review Comment:
   Too many tabs open. Fixed. Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4225 -- add detection for amf [tika]

2024-03-25 Thread via GitHub


tballison commented on code in PR #1688:
URL: https://github.com/apache/tika/pull/1688#discussion_r1538062689


##
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml:
##
@@ -3286,6 +3286,12 @@
 
 
   
+  
+
https://en.wikipedia.org/wiki/Additive_manufacturing_file_format
+
+

Review Comment:
   https://en.wikipedia.org/wiki/Additive_manufacturing_file_format Fixed. 
Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] (TIKA-4225) Add detection for AMF

2024-03-25 Thread Tim Allison (Jira)


[ https://issues.apache.org/jira/browse/TIKA-4225 ]


Tim Allison deleted comment on TIKA-4225:
---

was (Author: talli...@mitre.org):
:facepalm: too many windows open. Thank you!

> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830643#comment-17830643
 ] 

Tim Allison commented on TIKA-4225:
---

:facepalm: too many windows open. Thank you!

> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830633#comment-17830633
 ] 

ASF GitHub Bot commented on TIKA-4225:
--

theobisproject commented on code in PR #1688:
URL: https://github.com/apache/tika/pull/1688#discussion_r1538041575


##
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml:
##
@@ -3286,6 +3286,12 @@
 
 
   
+  
+
https://en.wikipedia.org/wiki/Additive_manufacturing_file_format
+
+

Review Comment:
   This is the file ending for the OpenSCAD format. File ending for amf is 
`*.amf`





> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4225 -- add detection for amf [tika]

2024-03-25 Thread via GitHub


theobisproject commented on code in PR #1688:
URL: https://github.com/apache/tika/pull/1688#discussion_r1538041575


##
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml:
##
@@ -3286,6 +3286,12 @@
 
 
   
+  
+
https://en.wikipedia.org/wiki/Additive_manufacturing_file_format
+
+

Review Comment:
   This is the file ending for the OpenSCAD format. File ending for amf is 
`*.amf`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread Robin Schimpf (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830632#comment-17830632
 ] 

Robin Schimpf commented on TIKA-4224:
-

Reading the spec at 
[https://github.com/3MFConsortium/spec_core/blob/master/3MF%20Core%20Specification.md]
 there is no mention of the [ContentTypes].xml file. From the recommendation at 
[https://github.com/3MFConsortium/spec_core/blob/master/3MF%20Core%20Specification.md#22-part-naming-recommendations]
 the /3D/3dModel.model file should be checked which is an xml file.

> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830621#comment-17830621
 ] 

ASF GitHub Bot commented on TIKA-4225:
--

tballison opened a new pull request, #1688:
URL: https://github.com/apache/tika/pull/1688

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4225 -- add detection for amf [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1688:
URL: https://github.com/apache/tika/pull/1688

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4220) Commons-compress too lenient on headless tar detection

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830614#comment-17830614
 ] 

ASF GitHub Bot commented on TIKA-4220:
--

tballison merged PR #1687:
URL: https://github.com/apache/tika/pull/1687




> Commons-compress too lenient on headless tar detection
> --
>
> Key: TIKA-4220
> URL: https://issues.apache.org/jira/browse/TIKA-4220
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> On recent regression tests on TIKA-4218, we noticed a fairly major change 
> with an increased rate of false positives on headless tar detection from 
> commons-compress.
> I think for now we should copy/paste/fork the headless tar detection and 
> improve it/revert it or possibly remove it for our 2.9.2 release.
> On this ticket, I'll look into what changed recently in headless tar 
> detection in commons-compress and experiment with fixing it.
> One challenge is that our magic bytes detection happens _after_ our custom 
> detectors, which means that we can't put a low confidence on what comes out 
> of our custom detectors and let the magic detection fix it. We could  
> implement an x-tar special case, but I really don't like that.
> Let's see what we can do...
> The numbers below represent the number of files identified as A (in tika 
> 2.9.1) -> B (in tika-2.9.2-pre-rc1).
> application/octet-stream -> application/x-tar 826
> multipart/appledouble -> application/x-tar701
> image/x-tga -> application/x-tar  322
> image/vnd.microsoft.icon -> application/x-tar 312
> application/vnd.iccprofile -> application/x-tar   221
> video/mp4 -> application/x-tar177
> audio/mpeg -> application/x-tar   59
> video/x-m4v -> application/x-tar  59
> application/x-font-printer-metric -> application/x-tar36
> audio/mp4 -> application/x-tar25
> application/x-tex-tfm -> application/x-tar18
> image/x-pict -> application/x-tar 15
> image/png -> application/x-tar8
> text/plain; charset=ISO-8859-1 -> application/x-tar   8
> application/x-endnote-style -> application/x-tar  7
> application/x-font-ttf -> application/x-tar   6



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4220 -- temporary workaround for tar detection regression [tika]

2024-03-25 Thread via GitHub


tballison merged PR #1687:
URL: https://github.com/apache/tika/pull/1687


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4221:
--
Summary: Regression in pack200 parsing in commons-compress  (was: 
Regression in unpack200 parsing in commons-compress)

> Regression in pack200 parsing in commons-compress
> -
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There's a regression in unpack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in unpack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830551#comment-17830551
 ] 

Tim Allison edited comment on TIKA-4221 at 3/25/24 5:09 PM:


This is caused by a modification of pack200's Archive class. In 
commons-compress 1.25.0, the inputstream was wrapped as a 
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code 
that unwraps FIlterInputStreams to get down to the source stream. This means 
that this now defeats CloseShieldInputStream, and the underlying stream is 
closed.

See: 
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66

This only causes problems when an pack200 file is embedded in another file with 
an ArchiveInputStream, which is why it is happening so rarely in our corpus.

That said, this is less than ideal.

We can probably work around this by writing our own CloseShieldInputStream that 
doesn't extend FilterInputStream. 


was (Author: talli...@mitre.org):
This is caused by a modification of unpack200's Archive class. In 
commons-compress 1.25.0, the inputstream was wrapped as a 
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code 
that unwraps FIlterInputStreams to get down to the source stream. This means 
that this now defeats CloseShieldInputStream, and the underlying stream is 
closed.

See: 
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66

This only causes problems when an unpack200 file is embedded in another file 
with an ArchiveInputStream, which is why it is happening so rarely in our 
corpus.

That said, this is less than ideal.

We can probably work around this by writing our own CloseShieldInputStream that 
doesn't extend FilterInputStream. 

> Regression in pack200 parsing in commons-compress
> -
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There's a regression in pack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 

[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread Robin Schimpf (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830607#comment-17830607
 ] 

Robin Schimpf commented on TIKA-4223:
-

If I understand the Wikipedia article correct the ASCII file has to start with 
"solid". The text afterwards is the model name. So this would be flexible.

Also the "OpenSCAD Model" in the binary file seems to be the model name. 
Wikipedia mentions a header of 80 bytes but there seems to be no magic bytes 
present for detection. So maybe the only way would be the file ending?

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4221) Regression in pack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4221:
--
Description: 
There's a regression in pack200 that leads to the InputStream being closed even 
if wrapped in a CloseShieldInputStream.

This was the original signal that something was wrong, but the real problem is 
in pack200, not xz.


We noticed ~10 xz files with fewer attachments in the recent regression tests 
in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, but 
not a blocker (IMHO).

The stacktrace from 
{{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
  looks like this:

3: X-TIKA:EXCEPTION:embedded_exception : 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.DefaultParser@56a4479a
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
at 
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
at 
org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
at 
org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
...
Caused by: org.tukaani.xz.XZIOException: Stream closed
at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
at 
org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at 
org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at 
org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at 
org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
at 
org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
... 85 more

  was:
There's a regression in unpack200 that leads to the InputStream being closed 
even if wrapped in a CloseShieldInputStream.

This was the original signal that something was wrong, but the real problem is 
in unpack200, not xz.


We noticed ~10 xz files with fewer attachments in the recent regression tests 
in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, but 
not a blocker (IMHO).

The stacktrace from 
{{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
  looks like this:

3: X-TIKA:EXCEPTION:embedded_exception : 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.DefaultParser@56a4479a
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
  

[jira] [Commented] (TIKA-4222) Add detection for OpenSCAD

2024-03-25 Thread Robin Schimpf (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830605#comment-17830605
 ] 

Robin Schimpf commented on TIKA-4222:
-

Yes I think the only way to detect it is via the file extension

> Add detection for OpenSCAD
> --
>
> Key: TIKA-4222
> URL: https://issues.apache.org/jira/browse/TIKA-4222
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
>
> OpenSCAD (https://openscad.org/index.html) is a 3D modeller based on a custom 
> script language. The files are currently detected as text/plain.
>  
>  
> Examples can be found here: 
> https://github.com/openscad/openscad/tree/master/examples



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830604#comment-17830604
 ] 

Tilman Hausherr commented on TIKA-4218:
---

To be honest I didn't look further, because these problems affected too many 
files. Yes please rerun the test so that whatever remains would stick out.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4220) Commons-compress too lenient on headless tar detection

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830598#comment-17830598
 ] 

ASF GitHub Bot commented on TIKA-4220:
--

tballison opened a new pull request, #1687:
URL: https://github.com/apache/tika/pull/1687

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Commons-compress too lenient on headless tar detection
> --
>
> Key: TIKA-4220
> URL: https://issues.apache.org/jira/browse/TIKA-4220
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> On recent regression tests on TIKA-4218, we noticed a fairly major change 
> with an increased rate of false positives on headless tar detection from 
> commons-compress.
> I think for now we should copy/paste/fork the headless tar detection and 
> improve it/revert it or possibly remove it for our 2.9.2 release.
> On this ticket, I'll look into what changed recently in headless tar 
> detection in commons-compress and experiment with fixing it.
> One challenge is that our magic bytes detection happens _after_ our custom 
> detectors, which means that we can't put a low confidence on what comes out 
> of our custom detectors and let the magic detection fix it. We could  
> implement an x-tar special case, but I really don't like that.
> Let's see what we can do...
> The numbers below represent the number of files identified as A (in tika 
> 2.9.1) -> B (in tika-2.9.2-pre-rc1).
> application/octet-stream -> application/x-tar 826
> multipart/appledouble -> application/x-tar701
> image/x-tga -> application/x-tar  322
> image/vnd.microsoft.icon -> application/x-tar 312
> application/vnd.iccprofile -> application/x-tar   221
> video/mp4 -> application/x-tar177
> audio/mpeg -> application/x-tar   59
> video/x-m4v -> application/x-tar  59
> application/x-font-printer-metric -> application/x-tar36
> audio/mp4 -> application/x-tar25
> application/x-tex-tfm -> application/x-tar18
> image/x-pict -> application/x-tar 15
> image/png -> application/x-tar8
> text/plain; charset=ISO-8859-1 -> application/x-tar   8
> application/x-endnote-style -> application/x-tar  7
> application/x-font-ttf -> application/x-tar   6



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4220 -- temporary workaround for tar detection regression [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1687:
URL: https://github.com/apache/tika/pull/1687

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830589#comment-17830589
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison merged PR #1686:
URL: https://github.com/apache/tika/pull/1686




> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There's a regression in unpack200 that leads to the InputStream being closed 
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem 
> is in unpack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4221 -- temporary workaround [tika]

2024-03-25 Thread via GitHub


tballison merged PR #1686:
URL: https://github.com/apache/tika/pull/1686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4221:
--
Description: 
There's a regression in unpack200 that leads to the InputStream being closed 
even if wrapped in a CloseShieldInputStream.

This was the original signal that something was wrong, but the real problem is 
in unpack200, not xz.


We noticed ~10 xz files with fewer attachments in the recent regression tests 
in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, but 
not a blocker (IMHO).

The stacktrace from 
{{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
  looks like this:

3: X-TIKA:EXCEPTION:embedded_exception : 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.DefaultParser@56a4479a
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
at 
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
at 
org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
at 
org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
...
Caused by: org.tukaani.xz.XZIOException: Stream closed
at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
at 
org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at 
org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at 
org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at 
org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
at 
org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
... 85 more

  was:
We noticed ~10 xz files with fewer attachments in the recent regression tests 
in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, but 
not a blocker (IMHO).

The stacktrace from 
{{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
  looks like this:

3: X-TIKA:EXCEPTION:embedded_exception : 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.DefaultParser@56a4479a
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
at 
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
at 

[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830585#comment-17830585
 ] 

Tim Allison commented on TIKA-4218:
---

[~tilman] did you see any other blockers/surprises? Once I merge TIKA-4221, 
I'll rerun the regression tests if there's not anything else to fix.

I see you've updated pdfbox already! :D

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4225) Add detection for AMF

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830580#comment-17830580
 ] 

Tim Allison commented on TIKA-4225:
---

Should be straightforward to add a lookup for an xml root="amf". 

> Add detection for AMF
> -
>
> Key: TIKA-4225
> URL: https://issues.apache.org/jira/browse/TIKA-4225
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.amf
>
>
> AMF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into AMF the result file is detected as application/xml.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.amf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/Additive_manufacturing_file_format]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4224) Add detection for 3MF

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830579#comment-17830579
 ] 

Tim Allison commented on TIKA-4224:
---

We can read [ContentTypes].xml and look for a "model" entry: 
application/vnd.ms-package.3dmanufacturing-3dmodel+xml?

> Add detection for 3MF
> -
>
> Key: TIKA-4224
> URL: https://issues.apache.org/jira/browse/TIKA-4224
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude.3mf
>
>
> 3MF is an alternative format to STL for 3D models. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into 3MF the result file is detected as application/zip.
>  
> Export command
> {code:java}
> openscad.exe -o result\linear_extrude.3mf examples\Basics\linear_extrude.scad 
> {code}
> Refs:
> [https://en.wikipedia.org/wiki/3D_Manufacturing_Format]
> [https://3mf.io/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830577#comment-17830577
 ] 

Tim Allison commented on TIKA-4223:
---

I'm guessing we can rely on the magic in these examples? "OpenSCAD Model" for 
the binary and "solid OpenSCAD_Model" for the text? Or is there some 
flexibility?

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4223:
--
Issue Type: Improvement  (was: Bug)

> STL file exported with OpenSCAD not detected correctly
> --
>
> Key: TIKA-4223
> URL: https://issues.apache.org/jira/browse/TIKA-4223
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.9.1
>Reporter: Robin Schimpf
>Priority: Major
> Attachments: linear_extrude_ascii.stl, linear_extrude_binary.stl
>
>
> STL files can be in ASCII or in binary format. Exporting this file 
> ([https://github.com/openscad/openscad/blob/master/examples/Basics/linear_extrude.scad)]
>  with OpenSCAD into STL the ASCII result file is detected as text/plain.
> Also the binary STL is detected with application/vnd.ms-pki.stl which differs 
> from the model/stl mime-type Wikipedia lists for those files.
>  
> Used commands for attached files
> {code:java}
> openscad.exe --export-format asciistl -o result\linear_extrude_ascii.stl 
> examples\Basics\linear_extrude.scad {code}
> {code:java}
> openscad.exe --export-format binstl -o result\linear_extrude_binary.stl 
> examples\Basics\linear_extrude.scad
> {code}
> Refs:
> https://en.wikipedia.org/wiki/STL_(file_format)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4222) Add detection for OpenSCAD

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830575#comment-17830575
 ] 

Tim Allison commented on TIKA-4222:
---

Looks like no magic is available. We'll have to rely on file extension?

> Add detection for OpenSCAD
> --
>
> Key: TIKA-4222
> URL: https://issues.apache.org/jira/browse/TIKA-4222
> Project: Tika
>  Issue Type: Improvement
>Reporter: Robin Schimpf
>Priority: Major
>
> OpenSCAD (https://openscad.org/index.html) is a 3D modeller based on a custom 
> script language. The files are currently detected as text/plain.
>  
>  
> Examples can be found here: 
> https://github.com/openscad/openscad/tree/master/examples



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830572#comment-17830572
 ] 

ASF GitHub Bot commented on TIKA-4171:
--

tballison merged PR #1679:
URL: https://github.com/apache/tika/pull/1679




> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2=ee87dc4bd1=0.0.7=msg-f:1782641700487887488=18bd372e8760fa80=fimg=ip=s0-l75-ft=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0=emb=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4219 -- improve epub handling of encrypted non-text-containing items [tika]

2024-03-25 Thread via GitHub


tballison merged PR #1684:
URL: https://github.com/apache/tika/pull/1684


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4219) Figure out what to do with epubs with encrypted non-core content

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830573#comment-17830573
 ] 

ASF GitHub Bot commented on TIKA-4219:
--

tballison merged PR #1684:
URL: https://github.com/apache/tika/pull/1684




> Figure out what to do with epubs with encrypted non-core content
> 
>
> Key: TIKA-4219
> URL: https://issues.apache.org/jira/browse/TIKA-4219
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> On TIKA-4218, we noticed several epubs that were now being identified as 
> encrypted, which is good. We did this work on TIKA-4176.
> On the other hand, we found several epubs that were now identified as 
> encrypted but which had content before we were doing the encryption detection.
> The issue in at least one file that I reviewed is that non-core content is 
> encrypted -- the fonts. So, from a text+metadata extraction, we could still 
> get all the content and then throw an Encrypted Exception or maybe flag 
> something as encrypted.
> I'm not sure what the best thing to do is in this case.
> An example file is here: 
> http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4171 -- fix regression when field value is missing in XFA [tika]

2024-03-25 Thread via GitHub


tballison merged PR #1679:
URL: https://github.com/apache/tika/pull/1679


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830570#comment-17830570
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison opened a new pull request, #1686:
URL: https://github.com/apache/tika/pull/1686

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
> 

[jira] [Commented] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830569#comment-17830569
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison closed pull request #1685: TIKA-4221-temporary fix
URL: https://github.com/apache/tika/pull/1685




> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4221-temporary fix [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1685:
URL: https://github.com/apache/tika/pull/1685

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] TIKA-4221 -- temporary workaround [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1686:
URL: https://github.com/apache/tika/pull/1686

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4221-temporary fix [tika]

2024-03-25 Thread via GitHub


tballison closed pull request #1685: TIKA-4221-temporary fix
URL: https://github.com/apache/tika/pull/1685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830568#comment-17830568
 ] 

ASF GitHub Bot commented on TIKA-4221:
--

tballison opened a new pull request, #1685:
URL: https://github.com/apache/tika/pull/1685

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
> 

[jira] [Updated] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4221:
--
Priority: Major  (was: Blocker)

> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4221:
--
Priority: Blocker  (was: Minor)

> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830551#comment-17830551
 ] 

Tim Allison commented on TIKA-4221:
---

This is caused by a modification of unpack200's Archive class. In 
commons-compress 1.25.0, the inputstream was wrapped as a 
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code 
that unwraps FIlterInputStreams to get down to the source stream. This means 
that this now defeats CloseShieldInputStream, and the underlying stream is 
closed.

See: 
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66

This only causes problems when an unpack200 file is embedded in another file 
with an ArchiveInputStream, which is why it is happening so rarely in our 
corpus.

That said, this is less than ideal.

We can probably work around this by writing our own CloseShieldInputStream that 
doesn't extend FilterInputStream. 

> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   

[jira] [Updated] (TIKA-4221) Regression in unpack200 parsing in commons-compress

2024-03-25 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4221:
--
Summary: Regression in unpack200 parsing in commons-compress  (was: 
Regression in xz parsing in commons-compress)

> Regression in unpack200 parsing in commons-compress
> ---
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests 
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem, 
> but not a blocker (IMHO).
> The stacktrace from 
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
>   looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception : 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.DefaultParser@56a4479a
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>   at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
>   at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
>   at 
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
>   at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
>   at 
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at 
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
>   at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
>   at 
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>   ... 85 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4219) Figure out what to do with epubs with encrypted non-core content

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830493#comment-17830493
 ] 

Tim Allison commented on TIKA-4219:
---

This fix tries to extract all content.

In "regular" non-streaming handling, if a content file is encrypted, this 
throws an EncryptedDocumentException immediately. If a non-content resource is 
encrypted, this throws an EncryptedDocumentException after extracting all the 
content.

In streaming mode, this throws an EncryptedDocumentException for anything that 
is encrypted.

The triggering file also showed that we should strip out qnames in our 
handlers. It is possible that xml: namespaces can creep into attributes or 
qnames.

What was weird was that plain tika-app extracted all the content from this file 
in earlier versions (before the encryption "fix") because the handler created 
in plain tika-app is apparently not namespace aware (?), whereas the 
ToTextHandler is(?).  So, we got the full content out of tika-app, but not 
tika-app -J.

This is now also fixed.

> Figure out what to do with epubs with encrypted non-core content
> 
>
> Key: TIKA-4219
> URL: https://issues.apache.org/jira/browse/TIKA-4219
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> On TIKA-4218, we noticed several epubs that were now being identified as 
> encrypted, which is good. We did this work on TIKA-4176.
> On the other hand, we found several epubs that were now identified as 
> encrypted but which had content before we were doing the encryption detection.
> The issue in at least one file that I reviewed is that non-core content is 
> encrypted -- the fonts. So, from a text+metadata extraction, we could still 
> get all the content and then throw an Encrypted Exception or maybe flag 
> something as encrypted.
> I'm not sure what the best thing to do is in this case.
> An example file is here: 
> http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Tika chart cannot be reached

2024-03-25 Thread Tim Allison
Looks like it is back up?

https://apache.jfrog.io/ui/native/tika/tika/

Also looks like we never pushed 2.9.1. We'll make sure to push 2.9.2 when
that is ready.

On Mon, Mar 25, 2024 at 9:04 AM Francesco Scuccimarri <
francesco.scuccima...@maggioli.it> wrote:

> Hi Team Dev Tika,
> Over the past few days, I've encountered an issue while trying to use
> tika-helm . When I attempt to add the
> repository for Tika charts using the Helm command, I receive the following
> error message:
>
> *Looks like 'https://apache.jfrog.io/artifactory/tika/
> ' is not a valid chart
> repository or cannot be reached.*
>
> It seems that the issue is specific to the Tika chart repository.
> Do you have any updates regarding any changes to the Tika chart repository
> or its accessibility? I've reviewed the documentation and searched online,
> but I haven't found any recent information about this issue.
>
> Thank you very much for your support.
>
> Best regards,
> Francesco Scuccimarri
>


[jira] [Commented] (TIKA-4219) Figure out what to do with epubs with encrypted non-core content

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830490#comment-17830490
 ] 

ASF GitHub Bot commented on TIKA-4219:
--

tballison opened a new pull request, #1684:
URL: https://github.com/apache/tika/pull/1684

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Figure out what to do with epubs with encrypted non-core content
> 
>
> Key: TIKA-4219
> URL: https://issues.apache.org/jira/browse/TIKA-4219
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> On TIKA-4218, we noticed several epubs that were now being identified as 
> encrypted, which is good. We did this work on TIKA-4176.
> On the other hand, we found several epubs that were now identified as 
> encrypted but which had content before we were doing the encryption detection.
> The issue in at least one file that I reviewed is that non-core content is 
> encrypted -- the fonts. So, from a text+metadata extraction, we could still 
> get all the content and then throw an Encrypted Exception or maybe flag 
> something as encrypted.
> I'm not sure what the best thing to do is in this case.
> An example file is here: 
> http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4219 -- improve epub handling of encrypted non-text-containing items [tika]

2024-03-25 Thread via GitHub


tballison opened a new pull request, #1684:
URL: https://github.com/apache/tika/pull/1684

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Tika chart cannot be reached

2024-03-25 Thread Francesco Scuccimarri
Hi Team Dev Tika,
Over the past few days, I've encountered an issue while trying to use
tika-helm . When I attempt to add the
repository for Tika charts using the Helm command, I receive the following
error message:

*Looks like 'https://apache.jfrog.io/artifactory/tika/
' is not a valid chart
repository or cannot be reached.*

It seems that the issue is specific to the Tika chart repository.
Do you have any updates regarding any changes to the Tika chart repository
or its accessibility? I've reviewed the documentation and searched online,
but I haven't found any recent information about this issue.

Thank you very much for your support.

Best regards,
Francesco Scuccimarri


Re: apache tika helm repo down

2024-03-25 Thread Tim Allison
Thank you! This looks like a general ASF outage?

https://github.com/apache/arrow/issues/40744

On Fri, Mar 22, 2024 at 12:45 PM Piero Susca  wrote:

> Error: looks like "https://apache.jfrog.io/artifactory/tika; is not a
> valid
> chart repository or cannot be reached: error converting YAML to JSON: yaml:
> line 22: mapping values are not allowed in this context
>


Re: Artifactory TIKA Helm Chart not reachable

2024-03-25 Thread toni . tauro
Hi 

Just saw, the whole apache.jfrog.io is down. 

https://github.com/apache/arrow/issues/40744

So please ignore my mail. 

Thank you for your work!

Toni


-- 

Adfinis AG
Antonio Tauro, System Engineer, GPG KeyID: 0x0796132F0077A5F8
Güterstrasse 86 | CH-4053 Basel
Office +41 61 500 31 31 | Direct +41 61 500 31 37
www.adfinis.com
On 03/25 09:34, toni.ta...@adfinis.com wrote:
> Hi
> 
> Starting this weekend, our pipelines are not working anymore due to the
> Apache Tika chart not being available anymore at 
> 
> https://apache.jfrog.io/artifactory/tika
> 
> The Page redirects directly to a landing page, no index.yaml is available.
> 
> ~ curl https://apache.jfrog.io/artifactory/tika/index.yaml -I
> HTTP/1.1 302 Moved Temporarily
> Date: Mon, 25 Mar 2024 08:33:26 GMT
> Content-Type: text/html
> Content-Length: 138
> Connection: keep-alive
> Location: https://landing.jfrog.com/reactivate-server/apache
> 
> Can you please take a look at it?
> 
> BR -- Toni
> 
> -- 
> 
> Adfinis AG
> Antonio Tauro, System Engineer, GPG KeyID: 0x0796132F0077A5F8
> Güterstrasse 86 | CH-4053 Basel
> Office +41 61 500 31 31 | Direct +41 61 500 31 37
> www.adfinis.com




Artifactory TIKA Helm Chart not reachable

2024-03-25 Thread toni . tauro
Hi

Starting this weekend, our pipelines are not working anymore due to the
Apache Tika chart not being available anymore at 

https://apache.jfrog.io/artifactory/tika

The Page redirects directly to a landing page, no index.yaml is available.

~ curl https://apache.jfrog.io/artifactory/tika/index.yaml -I
HTTP/1.1 302 Moved Temporarily
Date: Mon, 25 Mar 2024 08:33:26 GMT
Content-Type: text/html
Content-Length: 138
Connection: keep-alive
Location: https://landing.jfrog.com/reactivate-server/apache

Can you please take a look at it?

BR -- Toni

-- 

Adfinis AG
Antonio Tauro, System Engineer, GPG KeyID: 0x0796132F0077A5F8
Güterstrasse 86 | CH-4053 Basel
Office +41 61 500 31 31 | Direct +41 61 500 31 37
www.adfinis.com


signature.asc
Description: PGP signature


Re: [PR] Bump org.ow2.asm:asm from 9.6 to 9.7 [tika]

2024-03-25 Thread via GitHub


THausherr merged PR #1680:
URL: https://github.com/apache/tika/pull/1680


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Bump de.thetaphi:forbiddenapis from 3.6 to 3.7 [tika]

2024-03-25 Thread via GitHub


THausherr merged PR #1681:
URL: https://github.com/apache/tika/pull/1681


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org