[jira] [Commented] (TIKA-4064) Update to 2.8.1

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731281#comment-17731281
 ] 

Hudson commented on TIKA-4064:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk11 #1105 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1105/])
TIKA-4064: update build plugins (tilman: 
[https://github.com/apache/tika/commit/f2122dbcf2a8426d141e68591ad47730abfc160a])
* (edit) tika-parent/pom.xml
* (edit) tika-core/pom.xml


> Update to 2.8.1
> ---
>
> Key: TIKA-4064
> URL: https://issues.apache.org/jira/browse/TIKA-4064
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.8.0
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 2.8.1
>
>
> The latest maven versions plugin finds much more outdated stuff than the 
> previous one, so I'll do a few updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4064) Update to 2.8.1

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731266#comment-17731266
 ] 

Hudson commented on TIKA-4064:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1104 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1104/])
TIKA-4064: update build plugins, cxf, aws (tilman: 
[https://github.com/apache/tika/commit/57d29fb6633a3c65fd40a29b93287f4d4695727d])
* (edit) tika-parent/pom.xml


> Update to 2.8.1
> ---
>
> Key: TIKA-4064
> URL: https://issues.apache.org/jira/browse/TIKA-4064
> Project: Tika
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.8.0
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 2.8.1
>
>
> The latest maven versions plugin finds much more outdated stuff than the 
> previous one, so I'll do a few updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4061) Incorrect Automatic-Module-Name in tika-parser-crypto-module

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731242#comment-17731242
 ] 

Hudson commented on TIKA-4061:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4061 -- incorrect automatic module name in crypto parser module (tallison: 
[https://github.com/apache/tika/commit/710d972ee1278e347b02527269050df727ee7ce8])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/pom.xml


> Incorrect Automatic-Module-Name in tika-parser-crypto-module
> 
>
> Key: TIKA-4061
> URL: https://issues.apache.org/jira/browse/TIKA-4061
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Jerome Isaac Haltom
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.8.1
>
>
> The Automatic-Module-Name property for tika-parse-crypto-module.jar in 
> MANIFEST.MF is set to org.apache.tika.parser.code. This is the incorrect 
> value.
> This current blocks usage of Tika's Maven artifacts within IKVM projects. It 
> probably has ramifications for JDK9+ projects using modules as well, but 
> that's not me, so I don't know.
> [https://github.com/ikvmnet/ikvm-maven/issues/33]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731249#comment-17731249
 ] 

Hudson commented on TIKA-4060:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4060 Test AAC files, based on testWAV.wav, one without ID3, one with dummy 
ID3 values (nick: 
[https://github.com/apache/tika/commit/500900d67ede02e87440caa9f67501d3fe59b770])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAACid3.aac
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testAAC.aac


> Add magic to audio/aac in tika-mimetypes.xml
> 
>
> Key: TIKA-4060
> URL: https://issues.apache.org/jira/browse/TIKA-4060
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Minor
> Fix For: 2.8.1
>
> Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4003) application/vnd.isac.fcs

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731254#comment-17731254
 ] 

Hudson commented on TIKA-4003:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4003 (#1150) (github: 
[https://github.com/apache/tika/commit/487f694938b99a507ea57349e3db084e6c25414b])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/OneOffMimeTest.java
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
TIKA-4003 -- add extra spaces to application/vnd.isac.fcs (tallison: 
[https://github.com/apache/tika/commit/daad9eba7ef37d570d0ee12685c7a86a687f029a])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> application/vnd.isac.fcs
> 
>
> Key: TIKA-4003
> URL: https://issues.apache.org/jira/browse/TIKA-4003
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
> Attachments: 3215apc_14.fcs, 
> BD-FACS_Aria_II-Compensation_Controls_B515_Stained_Control.fcs, 
> Beckman_Coulter-Cyan.fcs
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3056) General upgrades for 1.24

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731245#comment-17731245
 ] 

Hudson commented on TIKA-3056:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-3056 -- add magic for ms-fontobject (tallison: 
[https://github.com/apache/tika/commit/0f8ea6183f3eead20d60c9f9140680d6ad8bec6e])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> General upgrades for 1.24
> -
>
> Key: TIKA-3056
> URL: https://issues.apache.org/jira/browse/TIKA-3056
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4048) Gzipped WARC not identifying all assets

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731253#comment-17731253
 ] 

Hudson commented on TIKA-4048:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4048 -- change default decompressConcatenated to true in CompressorParser 
(#1166) (github: 
[https://github.com/apache/tika/commit/1f41ead892b49606c8bc43c97b48d6a05af4becd])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/test/resources/test-documents/multiple.gz
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/GzipParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/test/resources/org/apache/tika/parser/pkg/tika-gzip-config.xml
* (edit) CHANGES.txt


> Gzipped WARC not identifying all assets
> ---
>
> Key: TIKA-4048
> URL: https://issues.apache.org/jira/browse/TIKA-4048
> Project: Tika
>  Issue Type: Bug
>Reporter: Gregory Lepore
>Priority: Minor
> Fix For: 2.8.1
>
> Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot 
> 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, 
> rec-20230518121844489398-5335604b8b23.warc.gz, 
> rec-20230518121844489398-5335604b8b23.warc.gz.json, 
> rec-20230518121844489398-5335604b8b23.warc.json
>
>
> The WARC parser works for non GZipped WARC files, but for GZipped WARC files 
> it appears not all embedded files are being identified.
>  
> Processing a WARC.GZ file should return identical JSON output as the plain 
> WARC file, with the addition of the GZ file metadata. However, in the 
> attached JSON outputs, the JPEG present in the plain WARC file is not 
> represented in the WARC.GZ.json file.
>  
> Additionally, the warc: metadata is not being returned for all files, 
> although this may be by design. 
>  
> Attached are two JSON files, one for the GZipped WARC file and one for the 
> plain WARC file. And the two original files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4055) Write limit not working correctly in RecursiveParserWrapper

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731246#comment-17731246
 ] 

Hudson commented on TIKA-4055:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4055 -- fix bug in writelimit checks in RecursiveParserWrapper and a 
separate bug in /rmeta (#1156) (github: 
[https://github.com/apache/tika/commit/f41d8c35a78e845fc1adf548e8eea3df5463a63b])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* (edit) CHANGES.txt
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/log4j.properties
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/RecursiveMetadataResource.java
* (edit) 
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/RecursiveMetadataResourceTest.java


> Write limit not working correctly in RecursiveParserWrapper
> ---
>
> Key: TIKA-4055
> URL: https://issues.apache.org/jira/browse/TIKA-4055
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
>
> [~g...@rhobard.com] noticed that the write limit in the 
> RecursiveParserWrapper is not working correctly.  I can confirm this is a 
> bug.  I'm working on a fix now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4005) application/x-endnote-style

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731251#comment-17731251
 ] 

Hudson commented on TIKA-4005:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4005 (#1149) (github: 
[https://github.com/apache/tika/commit/223ec8e47efdae5748d6377491ddb24c2feade67])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> application/x-endnote-style
> ---
>
> Key: TIKA-4005
> URL: https://issues.apache.org/jira/browse/TIKA-4005
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4004) font/otf application/vnd.ms-opentype

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731252#comment-17731252
 ] 

Hudson commented on TIKA-4004:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4004 -- add magic for application/x-font-otf (tallison: 
[https://github.com/apache/tika/commit/8f8c9f9190df54fa843cf7dd5cdc34a3c87496ce])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> font/otf application/vnd.ms-opentype
> 
>
> Key: TIKA-4004
> URL: https://issues.apache.org/jira/browse/TIKA-4004
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
> Attachments: 00.warc, aller-bold.eot, aller-light.eot, 
> fleurons.eot, index.html_id=45_and_type=eot, index.html_id=67_and_type=eot, 
> index.html_id=75_and_type=eot, index.html_id=77_and_type=eot, 
> index.html_id=80_and_type=eot, index.html_id=83_and_type=eot, 
> index.html_id=84_and_type=eot
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4002) application/vnd.tcpdump.pcapng

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731244#comment-17731244
 ] 

Hudson commented on TIKA-4002:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4002 -- add mime type detection for pcapng (#1152) (github: 
[https://github.com/apache/tika/commit/b0080e7df9cc4dda9a01a5fac6631c74a0e2a97a])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/OneOffMimeTest.java


> application/vnd.tcpdump.pcapng
> --
>
> Key: TIKA-4002
> URL: https://issues.apache.org/jira/browse/TIKA-4002
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
> Attachments: fmt_779_pcap_Packet_Capture_small_capture.pcap
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4046) Bump siegfried detector timeout to one minute

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731243#comment-17731243
 ] 

Hudson commented on TIKA-4046:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4046 -- bump siegfried timeout to 1 minute. (tallison: 
[https://github.com/apache/tika/commit/8877e9fc7ab2eb004ff7b5390aa281a7357a6eb1])
* (edit) 
tika-detectors/tika-detector-siegfried/src/main/java/org/apache/tika/detect/siegfried/SiegfriedDetector.java


> Bump siegfried detector timeout to one minute
> -
>
> Key: TIKA-4046
> URL: https://issues.apache.org/jira/browse/TIKA-4046
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.8.1
>
>
> It looks like I set the siegfried timeout to 6000 milliseconds.  I'm sure 
> that's a typo for 6  Let's bump it to a minute.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4063) PipesServer should not initialize emitters if the server will never emit results

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731240#comment-17731240
 ] 

Hudson commented on TIKA-4063:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4063 -- skip initialization of emitter in PipesServer if emitting from the 
server has been turned off. (tallison: 
[https://github.com/apache/tika/commit/1da3b76dee4aef19f0019eea0210a58fbaabcff2])
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java


> PipesServer should not initialize emitters if the server will never emit 
> results
> 
>
> Key: TIKA-4063
> URL: https://issues.apache.org/jira/browse/TIKA-4063
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.8.1
>
>
> As a safety valve for large extracts, we enabled direct emitting of data from 
> the PipesServer, without passing the data back to the PipesClient to be 
> emitted by the main process.
> If a user has disabled emitting from the PipesServer, we should not 
> initialize the emitters in the PipesServer.
> I ran into this recently because sqlite does not like multiple processes 
> interacting with the same db afaict.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4039) Allow users to set the maximum attachment size in the /unpack resource of tika-server

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731250#comment-17731250
 ] 

Hudson commented on TIKA-4039:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4039 (#1181) (github: 
[https://github.com/apache/tika/commit/2d9daef859296cad877caf29ad7765c0709472d0])
* (edit) CHANGES.txt
* (edit) 
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/UnpackerResourceTest.java
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/UnpackerResource.java


> Allow users to set the maximum attachment size in the /unpack resource of 
> tika-server
> -
>
> Key: TIKA-4039
> URL: https://issues.apache.org/jira/browse/TIKA-4039
> Project: Tika
>  Issue Type: Improvement
>  Components: config, parser
>Affects Versions: 2.7.0
>Reporter: Shay barak
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.8.1
>
> Attachments: tika-config.xml
>
>
> Adding the option to override the maximum bytes that Unrar parser can handle
> so I would not get the TikaMemoryLimitException.
> Wish to have the configuration to look like this:
> 
>             
>                  type="int">10
>             
> 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4054) Add various file identifications to reduce application/octet-stream

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731248#comment-17731248
 ] 

Hudson commented on TIKA-4054:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4054 -- add a bunch of mimes via Greg Lepore (#1158) (github: 
[https://github.com/apache/tika/commit/4edff73f0fe3da1df0ba8d8c5a367fbd35b2af34])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/OneOffMimeTest.java
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Add various file identifications to reduce application/octet-stream
> ---
>
> Key: TIKA-4054
> URL: https://issues.apache.org/jira/browse/TIKA-4054
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Major
> Fix For: 2.8.1
>
>
> Catch all task for various format identification data which are currently 
> being identified as application/octet-stream. Most data is from PRONOM.
>  
> SPSS Data File
> application/x-spss-sav
> ||External signatures|File extension: sav|
> ||Internal signatures||
> ||Name|SPSS Data File|
> ||Description|BOF: $FL2@(#)|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|24464C3240282329|
>  
> Amiga Disk File
> application/x-amiga-disk-format
> ||External signatures|File extension: adf|
> ||Internal signatures||
> ||Name|Amiga Disk File|
> ||Description|BOF: ‘DOS’ followed by ‘00\|01\|02\|03\|04\|05\|06\|07’ 
> depending on the format of the disk. More information on the internal 
> signature can be found here: [http://lclevy.free.fr/adflib/adf_info.html#p41]|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|444F53(00\|01\|02\|03\|04\|05\|06\|07)|
>  
> JEOL NMR Spectroscopy
> chemical/x-jeol-jdf
> ||External signatures|File extension: jdf|
> ||Internal signatures| |
> ||Name|JDF NMR Spectroscopy big endian|
> ||Description|Big Endian: BOF: 4A454F4C2E4E4D52 (JEOL.NMR)|
> ||Byte sequences||
>  
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|4A454F4C2E4E4D52|
> | | |
> ||Name|JDF little endian|
> ||Description|Little Endian: 524D4E2E4C4F454A (RMN.LOEJ)|
> ||Byte sequences| |
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|524D4E2E4C4F454A|
>  
> ASPRS Lidar Data Exchange Format
> no mimetype found
> ||External signatures|File extension: las
> File extension: laz|
> ||Internal signatures||
> ||Name|ASPRS Lidar Data Exchange Format 1.2|
> ||Description|ASCII header: LASF, followed after 20 bytes by version number 
> 1.2|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Byte order| |
> ||Value|4C415346\{20}0102\{78}[00:99]|
>  
> ASPRS Lidar Data Exchange Format v1.1
> no mimetype found
> ||External signatures|File extension: las
> File extension: laz|
> ||Internal signatures||
> ||Name|ASPRS Lidar Data Exchange Format 1.1|
> ||Description|ASCII header: LASF, followed after 20 bytes by version number 
> 1.1|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Byte order| |
> ||Value|4C415346\{20}0101\{78}[00:99]|
>  
> 3D Studio
> image/x-3ds
> ||External signatures|File extension: 3ds|
> ||Internal signatures||
> ||Name|3D Studio (V1)|
> ||Description|Primary chunk ID, chunk length, version subchunk ID, chunk 
> length, version, 3D-editor chunk ID.|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Byte order|Little-endian|
> ||Value|4D4D\{4}02000A00(03\|04)\{3}3D3D|
> ||Name|3D Studio (V2)|
> ||Description|Primary chunk ID, chunk length, 3D-editor chunk ID|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|4D4D\{4}3D3D|
>  
> TAP (ZX Spectrum)
> [application/x-spectrum-tzx|https://www.digipres.org/formats/mime-types/#application/x-spectrum-tzx]
> ||External signatures|File extension: tap|
> ||Internal signatures||
> ||Name|TAPZX|
> ||Description|…\{20}ÿ|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|13\{20}FF|
>  
> Sibelius
> no mimetype found
> ||External signatures|File extension: sib|
> ||Internal signatures||
> ||Name|Sibelius|
> ||Description|Absolute from beginning of file, magic bytes: .SIBELIUS|
> ||Byte sequences||
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|0F534942454C495553|
>  
> Portable Sound Format
> no mimetype found
> ||External signatur

[jira] [Commented] (TIKA-3996) audio/x-sap

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731255#comment-17731255
 ] 

Hudson commented on TIKA-3996:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-3996 (#1151) (github: 
[https://github.com/apache/tika/commit/7118705ef36463a4fd9836f2caedb87dbd5c6ef7])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/OneOffMimeTest.java


> audio/x-sap
> ---
>
> Key: TIKA-3996
> URL: https://issues.apache.org/jira/browse/TIKA-3996
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
> Attachments: airwolf.sap, ala_ma_kota.sap, alchemia.sap
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4000) application/vnd.msa-disk-image

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731247#comment-17731247
 ] 

Hudson commented on TIKA-4000:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4000 -- add detection for magic shadow archiver (tallison: 
[https://github.com/apache/tika/commit/78ce839bcad8d21afc2ce5de48e6d5f6caddfe03])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> application/vnd.msa-disk-image
> --
>
> Key: TIKA-4000
> URL: https://issues.apache.org/jira/browse/TIKA-4000
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
> Attachments: DREAMZ2B.MSA, SOTART2.MSA, TIKBGBB2.MSA
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3941) Consider having pipesserver return intermediate results

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731256#comment-17731256
 ] 

Hudson commented on TIKA-3941:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-3941 -- allow reporting of intermediate results from the pipes processor 
(#1167) (github: 
[https://github.com/apache/tika/commit/6cea7717c7a90014cd86fa605cc1e9125f173cf4])
* (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncConfig.java
* (edit) tika-core/src/test/java/org/apache/tika/pipes/async/MockReporter.java
* (edit) 
tika-core/src/test/java/org/apache/tika/pipes/async/AsyncProcessorTest.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesResult.java
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (edit) 
tika-pipes/tika-pipes-reporters/tika-pipes-reporter-jdbc/src/test/java/org/apache/tika/pipes/reporters/jdbc/TestJDBCPipesReporter.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java
* (add) tika-core/src/test/java/org/apache/tika/pipes/PipesServerTest.java
* (add) 
tika-core/src/test/java/org/apache/tika/pipes/async/MockDigesterFactory.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesClient.java
* (add) tika-core/src/test/resources/org/apache/tika/pipes/TIKA-3941.xml
* (edit) tika-core/src/main/java/org/apache/tika/pipes/async/AsyncProcessor.java


> Consider having pipesserver return intermediate results
> ---
>
> Key: TIKA-3941
> URL: https://issues.apache.org/jira/browse/TIKA-3941
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
>
> If the pipes server crashes, the only information that the pipesclient 
> receives is of the crash.  It would be useful at a minimum to have the pipes 
> server report an intermediate result after file detection. 
> Ideally, at a minimum, the pipesclient could report file type, content-length 
> (if possible) and digest information.
>  
> On another ticket (future work), we could extend intermediate results to 
> include partial parses/metadata extraction.  The challenge here is that the 
> underlying metadata objects are not thread safe...so we'll punt this to deal 
> with later if necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4062) OfflineContentHandler/ContentHandlerDecorator does not provide option for custom error handling

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731241#comment-17731241
 ] 

Hudson commented on TIKA-4062:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4062 (#1179) (github: 
[https://github.com/apache/tika/commit/ceed7be8b1bffd697a79590e50a413744a0b108f])
* (edit) 
tika-core/src/main/java/org/apache/tika/exception/WriteLimitReachedException.java
* (edit) 
tika-core/src/main/java/org/apache/tika/sax/ContentHandlerDecorator.java


> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> ---
>
> Key: TIKA-4062
> URL: https://issues.apache.org/jira/browse/TIKA-4062
> Project: Tika
>  Issue Type: Bug
>  Components: tika-core
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0, 2.8.0
>Reporter: Ravi Ranjan Jha
>Priority: Critical
>
> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> Prior to the change of passing OfflineContentHandler to SAX Parser in 
> XMLReaderUtils.parseSAX, one could pass a custom ContentHandlerDecorator to 
> handle exception or override error/warning etc methods. The same is not 
> possible now because the default impl for handleException in the 
> OfflineContentHandler's parent ContentHandlerDecorator just throws exception 
> as shown below:
>  
>  protected void handleException(SAXException exception) throws SAXException {
>         throw exception;
>     }
>  
> which could probably be (at minimum)
> public void handleException(SAXException exception) throws SAXException {
>         handler.handleException(exception);
>     }
>  
> This is breaking our app's behavior. Please take it as priority.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4052) application/x-cdf

2023-06-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731239#comment-17731239
 ] 

Hudson commented on TIKA-4052:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1103 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1103/])
TIKA-4052 -- add detection for application/x-cdf (tallison: 
[https://github.com/apache/tika/commit/0f86aede1e2317b843a6f11ee702570c7d57737d])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> application/x-cdf
> -
>
> Key: TIKA-4052
> URL: https://issues.apache.org/jira/browse/TIKA-4052
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Gregory Lepore
>Priority: Major
> Fix For: 2.8.1
>
> Attachments: track05.cda, track06.cda, track07.cda
>
>
> Examining the Common Crawl files that return application/octet-stream.
>  
> application/x-cdf is one that should be fairly easy to add.
>  
> ||Name|CD Audio|
> ||Description|Files are 44 bytes in length, with header sequence ASCII: 
> RIFF$...CDDAfmt .|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Byte order| |
> ||Value|5249464624004341666D742018|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4064) Update to 2.8.1

2023-06-10 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4064:
-

 Summary: Update to 2.8.1
 Key: TIKA-4064
 URL: https://issues.apache.org/jira/browse/TIKA-4064
 Project: Tika
  Issue Type: Task
  Components: build
Affects Versions: 2.8.0
Reporter: Tilman Hausherr
 Fix For: 2.8.1


The latest maven versions plugin finds much more outdated stuff than the 
previous one, so I'll do a few updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


no tika builds for 29 days

2023-06-10 Thread Tilman Hausherr

There have been no tika builds for 29 days on the CI:



I've tried to start it manually, it failed claiming no maven was 
available. I then opened and saved the configuration and now it's running.


Tilman


[jira] [Commented] (TIKA-3941) Consider having pipesserver return intermediate results

2023-06-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731219#comment-17731219
 ] 

Tilman Hausherr commented on TIKA-3941:
---

{{PipesServerTest}} fails on windows, please change {{replaceAll}} to 
{{replace}} and it works

> Consider having pipesserver return intermediate results
> ---
>
> Key: TIKA-3941
> URL: https://issues.apache.org/jira/browse/TIKA-3941
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.8.1
>
>
> If the pipes server crashes, the only information that the pipesclient 
> receives is of the crash.  It would be useful at a minimum to have the pipes 
> server report an intermediate result after file detection. 
> Ideally, at a minimum, the pipesclient could report file type, content-length 
> (if possible) and digest information.
>  
> On another ticket (future work), we could extend intermediate results to 
> include partial parses/metadata extraction.  The challenge here is that the 
> underlying metadata objects are not thread safe...so we'll punt this to deal 
> with later if necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)