[jira] [Commented] (TIKA-4062) OfflineContentHandler/ContentHandlerDecorator does not provide option for custom error handling

2023-08-02 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750344#comment-17750344
 ] 

Nick Burch commented on TIKA-4062:
--

Between holidays and the length of time needed for regression runs + votes, I 
suspect late August / early September

> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> ---
>
> Key: TIKA-4062
> URL: https://issues.apache.org/jira/browse/TIKA-4062
> Project: Tika
>  Issue Type: Bug
>  Components: tika-core
>Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.7.0, 2.8.0
>Reporter: Ravi Ranjan Jha
>Priority: Critical
> Fix For: 2.8.1
>
>
> OfflineContentHandler/ContentHandlerDecorator does not provide option for 
> custom error handling
> Prior to the change of passing OfflineContentHandler to SAX Parser in 
> XMLReaderUtils.parseSAX, one could pass a custom ContentHandlerDecorator to 
> handle exception or override error/warning etc methods. The same is not 
> possible now because the default impl for handleException in the 
> OfflineContentHandler's parent ContentHandlerDecorator just throws exception 
> as shown below:
>  
>  protected void handleException(SAXException exception) throws SAXException {
>         throw exception;
>     }
>  
> which could probably be (at minimum)
> public void handleException(SAXException exception) throws SAXException {
>         handler.handleException(exception);
>     }
>  
> This is breaking our app's behavior. Please take it as priority.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-1180) Matroska (mkv, mka, webm) Detector

2023-08-02 Thread Wladimir Leite (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750493#comment-17750493
 ] 

Wladimir Leite commented on TIKA-1180:
--

I made a test with a set of ~5.000 files collected from many different sources, 
and only ~{*}4%{*} are correctly identified by the signatures defined in 
*tika-mimetypes.xml* (file extensions were hidden for the test).

Inspecting the content of these files and going through the format 
specification ([https://www.matroska.org/index.html]), I created a modified 
configuration (shown below) that seems to work better: *100%* of the WEBM and 
MKV files were correctly identified; MKA still relies on the file extension, 
but these audio files are extremely rare (while both video formats are widely 
used).

It would enhance the current configuration, without having to deal with 
additional code / libraries. 

By the way, tested the detector mentioned above 
([https://github.com/OmarAssadi/matroska-tika]). It worked fine, but it missed 
25 videos (~0.5%) that are correctly identified with the signatures described 
below. The detector also doesn't handle MKA's.

 
{code:java}
    
        
            
                
                
            
        
        
        


        
        
        


        
            
                
                
            
        
        
    {code}
 

 

 

> Matroska (mkv, mka, webm) Detector
> --
>
> Key: TIKA-1180
> URL: https://issues.apache.org/jira/browse/TIKA-1180
> Project: Tika
>  Issue Type: New Feature
>  Components: detector
>Affects Versions: 1.5
>Reporter: Nick Burch
>Priority: Major
>  Labels: new-parser
>
> Following the work on TIKA-1177, we now have mimetype entries for the various 
> formats which are based on the Matroska container (mkv, mka, webm etc). 
> However, we are unable to properly identify the specific type just from some 
> mime magic
> Instead, for fully accurate detection, we'll need a new Detector for the 
> Matroska family, which does some very simple container/stream processing to 
> work out what the container contains



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-1180) Matroska (mkv, mka, webm) Detector

2023-08-02 Thread Wladimir Leite (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750493#comment-17750493
 ] 

Wladimir Leite edited comment on TIKA-1180 at 8/2/23 10:19 PM:
---

I made a test with a set of ~4.400 files collected from many different sources, 
and only ~{*}4%{*} are correctly identified by the signatures defined in 
*tika-mimetypes.xml* (file extensions were hidden for the test).

Inspecting the content of these files and going through the format 
specification ([https://www.matroska.org/index.html]), I created a modified 
configuration (shown below) that seems to work better: *100%* of the WEBM and 
MKV files were correctly identified; MKA still relies on the file extension, 
but these audio files are extremely rare (while both video formats are widely 
used).

It would enhance the current configuration, without having to deal with 
additional code / libraries. 

By the way, tested the detector mentioned above 
([https://github.com/OmarAssadi/matroska-tika]). It worked fine, but it missed 
25 videos (~0.5%) that are correctly identified with the signatures described 
below. The detector also doesn't handle MKA's.

 
{code:java}
    
        
            
                
                
            
        
        
        


        
        
        


        
            
                
                
            
        
        
    {code}
 

Just as a reference, the discussion in our project about this: 
https://github.com/sepinf-inc/IPED/issues/1786

 


was (Author: tc-wleite):
I made a test with a set of ~5.000 files collected from many different sources, 
and only ~{*}4%{*} are correctly identified by the signatures defined in 
*tika-mimetypes.xml* (file extensions were hidden for the test).

Inspecting the content of these files and going through the format 
specification ([https://www.matroska.org/index.html]), I created a modified 
configuration (shown below) that seems to work better: *100%* of the WEBM and 
MKV files were correctly identified; MKA still relies on the file extension, 
but these audio files are extremely rare (while both video formats are widely 
used).

It would enhance the current configuration, without having to deal with 
additional code / libraries. 

By the way, tested the detector mentioned above 
([https://github.com/OmarAssadi/matroska-tika]). It worked fine, but it missed 
25 videos (~0.5%) that are correctly identified with the signatures described 
below. The detector also doesn't handle MKA's.

 
{code:java}
    
        
            
                
                
            
        
        
        


        
        
        


        
            
                
                
            
        
        
    {code}
 

 

 

> Matroska (mkv, mka, webm) Detector
> --
>
> Key: TIKA-1180
> URL: https://issues.apache.org/jira/browse/TIKA-1180
> Project: Tika
>  Issue Type: New Feature
>  Components: detector
>Affects Versions: 1.5
>Reporter: Nick Burch
>Priority: Major
>  Labels: new-parser
>
> Following the work on TIKA-1177, we now have mimetype entries for the various 
> formats which are based on the Matroska container (mkv, mka, webm etc). 
> However, we are unable to properly identify the specific type just from some 
> mime magic
> Instead, for fully accurate detection, we'll need a new Detector for the 
> Matroska family, which does some very simple container/stream processing to 
> work out what the container contains



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-1180) Matroska (mkv, mka, webm) Detector

2023-08-02 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750520#comment-17750520
 ] 

Luís Filipe Nassif commented on TIKA-1180:
--

Great and thank you [~tc-wleite]! Between your test files, does it happen to be 
possible to share 1 small MKV and 1 small WEBM without sensitive info that 
weren't detected properly before and are detected by your custom signatures, so 
we could write a proper unit test to avoid future regressions?

> Matroska (mkv, mka, webm) Detector
> --
>
> Key: TIKA-1180
> URL: https://issues.apache.org/jira/browse/TIKA-1180
> Project: Tika
>  Issue Type: New Feature
>  Components: detector
>Affects Versions: 1.5
>Reporter: Nick Burch
>Priority: Major
>  Labels: new-parser
>
> Following the work on TIKA-1177, we now have mimetype entries for the various 
> formats which are based on the Matroska container (mkv, mka, webm etc). 
> However, we are unable to properly identify the specific type just from some 
> mime magic
> Instead, for fully accurate detection, we'll need a new Detector for the 
> Matroska family, which does some very simple container/stream processing to 
> work out what the container contains



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] dependabot[bot] opened a new pull request, #1268: Bump aws.version from 1.12.520 to 1.12.521

2023-08-02 Thread via GitHub


dependabot[bot] opened a new pull request, #1268:
URL: https://github.com/apache/tika/pull/1268

   Bumps `aws.version` from 1.12.520 to 1.12.521.
   Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.520 to 1.12.521
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>com.amazonaws:aws-java-sdk-s3's
 changelog.
   
   1.12.521 2023-08-02
   AWS Budgets
   
   
   Features
   
   As part of CAE tagging integration we need to update our budget names 
regex filter to prevent customers from using "/action/" in their 
budget names.
   
   
   
   AWS Glue
   
   
   Features
   
   This release includes additional Glue Streaming KAKFA SASL property 
types.
   
   
   
   AWS Resilience Hub
   
   
   Features
   
   Drift Detection capability added when applications policy has moved from 
a meet to breach state. Customers will be able to exclude operational 
recommendations and receive credit in their resilience score. Customers can now 
add ARH permissions to an existing or new role.
   
   
   
   Amazon Cognito Identity Provider
   
   
   Features
   
   New feature that logs Cognito user pool error messages to CloudWatch 
logs.
   
   
   
   Amazon SageMaker Service
   
   
   Features
   
   SageMaker Inference Recommender introduces a new API 
GetScalingConfigurationRecommendation to recommend auto scaling policies based 
on completed Inference Recommender jobs.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/fbc67528f34087e1bace91d31cd5777153769779";>fbc6752
 AWS SDK for Java 1.12.521
   https://github.com/aws/aws-sdk-java/commit/a040ae230464a6fcf78dc0a023726a7dee0205a0";>a040ae2
 Update GitHub version number to 1.12.521-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.520...1.12.521";>compare 
view
   
   
   
   
   Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.520 to 1.12.521
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md";>com.amazonaws:aws-java-sdk-transcribe's
 changelog.
   
   1.12.521 2023-08-02
   AWS Budgets
   
   
   Features
   
   As part of CAE tagging integration we need to update our budget names 
regex filter to prevent customers from using "/action/" in their 
budget names.
   
   
   
   AWS Glue
   
   
   Features
   
   This release includes additional Glue Streaming KAKFA SASL property 
types.
   
   
   
   AWS Resilience Hub
   
   
   Features
   
   Drift Detection capability added when applications policy has moved from 
a meet to breach state. Customers will be able to exclude operational 
recommendations and receive credit in their resilience score. Customers can now 
add ARH permissions to an existing or new role.
   
   
   
   Amazon Cognito Identity Provider
   
   
   Features
   
   New feature that logs Cognito user pool error messages to CloudWatch 
logs.
   
   
   
   Amazon SageMaker Service
   
   
   Features
   
   SageMaker Inference Recommender introduces a new API 
GetScalingConfigurationRecommendation to recommend auto scaling policies based 
on completed Inference Recommender jobs.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/fbc67528f34087e1bace91d31cd5777153769779";>fbc6752
 AWS SDK for Java 1.12.521
   https://github.com/aws/aws-sdk-java/commit/a040ae230464a6fcf78dc0a023726a7dee0205a0";>a040ae2
 Update GitHub version number to 1.12.521-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.520...1.12.521";>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and 

[GitHub] [tika] dependabot[bot] opened a new pull request, #1269: Bump org.netpreserve:jwarc from 0.28.0 to 0.28.1

2023-08-02 Thread via GitHub


dependabot[bot] opened a new pull request, #1269:
URL: https://github.com/apache/tika/pull/1269

   Bumps [org.netpreserve:jwarc](https://github.com/iipc/jwarc) from 0.28.0 to 
0.28.1.
   
   Release notes
   Sourced from https://github.com/iipc/jwarc/releases";>org.netpreserve:jwarc's 
releases.
   
   v0.28.1
   Bugs fixed:
   
   Fixed output truncation with the CDX CLI tool due to OutputStreamWriter 
buffer not being flushed or closed before exit
   CdxWriter.process(files, useAbsolutePaths) ignored the 
useAbsolutePaths=false option case was always outputting absolute path
   CdxRequestEncoder: Improved pywb compatiblity for non-ASCII characters 
in url encoded request bodies
   CdxRequestEncoder: Fixed URLDecoder exception for large request bodies 
or those including invalid percent encoding
   WarcWriter.fetch: Fixed bug where maxTime limit accidentally used the 
value of maxLength option instead
   
   
   
   
   Commits
   
   https://github.com/iipc/jwarc/commit/7088551c3be94e02b9625a2c1f702eaf1c5dcf83";>7088551
 Release 0.28.1
   https://github.com/iipc/jwarc/commit/32b3090b5bde3a2b04a124525e7b6ace843f09df";>32b3090
 CdxWriter.process(): Fix useAbsolutePaths=false being ignored
   https://github.com/iipc/jwarc/commit/fc88458dcfd2ae9c8098a13345db52754ae9c095";>fc88458
 CdxTool: Fix truncated output by closing CdxWriter
   https://github.com/iipc/jwarc/commit/d47a479c9025ee8fa9fe7e494b33df5b48d802c7";>d47a479
 CdxRequestEncoder: Handle non-ASCII characters in form request body the same 
...
   https://github.com/iipc/jwarc/commit/ed4e7940b8d4b01e85bc80a3a2bedd0cd1319c86";>ed4e794
 CdxRequestEncoder: Fix exception if request body contains partial % 
encoding
   https://github.com/iipc/jwarc/commit/e88258583e318126a2100c691bfbcf24fc7b9a97";>e882585
 FetchTool: Write truncated response before exiting if stopped by Ctrl-C
   https://github.com/iipc/jwarc/commit/dd25a62fb2b1f0a4336b24bee47e6805136782b0";>dd25a62
 WarcWriter: Fix typo in maxTime option
   See full diff in https://github.com/iipc/jwarc/compare/v0.28.0...v0.28.1";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.netpreserve:jwarc&package-manager=maven&previous-version=0.28.0&new-version=0.28.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] THausherr merged pull request #1268: Bump aws.version from 1.12.520 to 1.12.521

2023-08-02 Thread via GitHub


THausherr merged PR #1268:
URL: https://github.com/apache/tika/pull/1268


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] THausherr merged pull request #1269: Bump org.netpreserve:jwarc from 0.28.0 to 0.28.1

2023-08-02 Thread via GitHub


THausherr merged PR #1269:
URL: https://github.com/apache/tika/pull/1269


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org