[jira] [Resolved] (TIKA-3993) Improve throttle logic in S3Fetcher

2023-03-29 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3993.
---
Fix Version/s: 2.7.1
   Resolution: Fixed

> Improve throttle logic in S3Fetcher
> ---
>
> Key: TIKA-3993
> URL: https://issues.apache.org/jira/browse/TIKA-3993
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.7.1
>
>
> We currently have "tries" and sleep ms amounts.  We should allow users to set 
> an array of seconds for backoff to tune their own logarithmic backoff or 
> other strategies.  Further, we're retrying all aws exceptions.  We should not 
> retry AccessDenied, NoSuchKey etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3993) Improve throttle logic in S3Fetcher

2023-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706554#comment-17706554
 ] 

ASF GitHub Bot commented on TIKA-3993:
--

tballison merged PR #1048:
URL: https://github.com/apache/tika/pull/1048




> Improve throttle logic in S3Fetcher
> ---
>
> Key: TIKA-3993
> URL: https://issues.apache.org/jira/browse/TIKA-3993
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> We currently have "tries" and sleep ms amounts.  We should allow users to set 
> an array of seconds for backoff to tune their own logarithmic backoff or 
> other strategies.  Further, we're retrying all aws exceptions.  We should not 
> retry AccessDenied, NoSuchKey etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] tballison merged pull request #1048: TIKA-3993 -- improve S3Fetcher backoff configurability

2023-03-29 Thread via GitHub


tballison merged PR #1048:
URL: https://github.com/apache/tika/pull/1048


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



next release?

2023-03-29 Thread Tim Allison
All,

  PDFBox will likely kick off a release in the next week or so.  Any
objections to running a Tika release shortly thereafter?  I'd like to
finish a few small things by then.  Anyone have any blockers?

  I feel that we have enough for a 2.8.0 release instead of a 2.7.1 release.

  What do you think?

  Best,

   Tim


[jira] [Commented] (TIKA-3993) Improve throttle logic in S3Fetcher

2023-03-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706511#comment-17706511
 ] 

ASF GitHub Bot commented on TIKA-3993:
--

tballison opened a new pull request, #1048:
URL: https://github.com/apache/tika/pull/1048

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Improve throttle logic in S3Fetcher
> ---
>
> Key: TIKA-3993
> URL: https://issues.apache.org/jira/browse/TIKA-3993
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> We currently have "tries" and sleep ms amounts.  We should allow users to set 
> an array of seconds for backoff to tune their own logarithmic backoff or 
> other strategies.  Further, we're retrying all aws exceptions.  We should not 
> retry AccessDenied, NoSuchKey etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [tika] tballison opened a new pull request, #1048: TIKA-3993 -- improve S3Fetcher backoff configurability

2023-03-29 Thread via GitHub


tballison opened a new pull request, #1048:
URL: https://github.com/apache/tika/pull/1048

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (TIKA-3993) Improve throttle logic in S3Fetcher

2023-03-29 Thread Tim Allison (Jira)
Tim Allison created TIKA-3993:
-

 Summary: Improve throttle logic in S3Fetcher
 Key: TIKA-3993
 URL: https://issues.apache.org/jira/browse/TIKA-3993
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


We currently have "tries" and sleep ms amounts.  We should allow users to set 
an array of seconds for backoff to tune their own logarithmic backoff or other 
strategies.  Further, we're retrying all aws exceptions.  We should not retry 
AccessDenied, NoSuchKey etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data

2023-03-29 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706405#comment-17706405
 ] 

Tim Allison commented on TIKA-3992:
---

Ah, that's helpful. Thank you!  By "truncated", I was referring to the feature 
of CC where they truncate fetches at 1MB.  So, we really don't have access to 
the ends of the files unless we refetch from the original URLs, which I am not 
proposing doing on this ticket.

We'll see what we can do with what we have...

> Add common missing mimes based on Common Crawl data
> ---
>
> Key: TIKA-3992
> URL: https://issues.apache.org/jira/browse/TIKA-3992
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
> detected by Tika.  It would be useful to extract those (even if truncated) 
> and run 'file' and 'siegfried' against those file types that are unknown to 
> Tika.  We can prioritize the most common file formats as identified by file 
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for 
> `application/zip`...there are likely zip-based file types that we could do a 
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data

2023-03-29 Thread Andrew Jackson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706400#comment-17706400
 ] 

Andrew Jackson commented on TIKA-3992:
--

Sounds interesting! Just wanted to note that Siegfried (and DROID/etc) 
signatures often require end-of-file matches as well as beginning-of-file, so 
if you do truncate the files you'll get the best results by chopping out the 
middle. I'd imagine the first and last few KB should do it.

> Add common missing mimes based on Common Crawl data
> ---
>
> Key: TIKA-3992
> URL: https://issues.apache.org/jira/browse/TIKA-3992
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
> detected by Tika.  It would be useful to extract those (even if truncated) 
> and run 'file' and 'siegfried' against those file types that are unknown to 
> Tika.  We can prioritize the most common file formats as identified by file 
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for 
> `application/zip`...there are likely zip-based file types that we could do a 
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3992) Add common missing mimes based on Common Crawl data

2023-03-29 Thread Tim Allison (Jira)
Tim Allison created TIKA-3992:
-

 Summary: Add common missing mimes based on Common Crawl data
 Key: TIKA-3992
 URL: https://issues.apache.org/jira/browse/TIKA-3992
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
detected by Tika.  It would be useful to extract those (even if truncated) and 
run 'file' and 'siegfried' against those file types that are unknown to Tika.  
We can prioritize the most common file formats as identified by file and 
siegfried for addition to our mime-types.xml.

Separately, we might also want to do the same thing for 
`application/zip`...there are likely zip-based file types that we could do a 
better job on.

Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


A Message from the Board to PMC members

2023-03-29 Thread Rich Bowen
Dear Apache Project Management Committee (PMC) members,

The Board wants to take just a moment of your time to communicate a few
things that seem to have been forgotten by a number of PMC members,
across the Foundation, over the past few years.  Please note that this
is being sent to all projects - yours has not been singled out.

The Project Management Committee (PMC) as a whole[1] is tasked with the
oversight, health, and sustainability of the project. The PMC members
are responsible collectively, and individually, for ensuring that the
project operates in a way that is in line with ASF philosophy, and in a
way that serves the developers and users of the project.

The PMC Chair is not the project leader, in any sense. It is the person
who files board reports and makes sure they are delivered on time. It
is the secretary for the project, and the project’s  ambassador to the
Board of Directors. The VP title is given as an artifact of US
corporate law, and not because the PMC Chair has any special powers. If
you are treating your PMC Chair as the project lead, or granting them
any other special powers or privileges, you need to be aware that
that’s not the intent of the Chair role. The Chair is a PMC member peer
with a few extra duties.

Every PMC member has an equal voice in deliberations. Each has one
vote. Each has veto power. Every vote weighs the same. It is not only
your right, but it is your obligation, to use that vote for the good of
the project and its users, not to appease the Chair, your employer, or
any other voice in the project. 

Every PMC member can, and should, nominate new committers, and new PMC
members. This is not the sole domain of the PMC Chair. This might be
your most important responsibility to the project, as succession
planning is the path to sustainability.

Every PMC member can, and should, respond when the Board sends email to
your private list. You should not wait for the PMC Chair to respond.
The Board views the entire PMC as responsible for the project, not just
one member.

Every PMC member should be subscribed to the private@ mailing list. If
you are not, then you are neglecting your duty of oversight. If you no
longer wish to be responsible for oversight of the project, you should
resign your PMC seat, not merely drop off of the private@ list and
ignore it. You can determine which PMC members are not subscribed to
your private list by looking at your PMC roster at
https://whimsy.apache.org/roster/committee/  Names with an asterisk (*)
next to them are not subscribed to the list. We encourage you to take a
moment to contact them with this information.

Thank you for your attention to these matters, and thank you for
keeping our projects healthy.

Rich, for The Board of Directors

[1] https://apache.org/foundation/how-it-works.html#pmc-members



[GitHub] [tika] THausherr merged pull request #1047: Bump aws.version from 1.12.436 to 1.12.437

2023-03-29 Thread via GitHub


THausherr merged PR #1047:
URL: https://github.com/apache/tika/pull/1047


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] dependabot[bot] opened a new pull request, #1047: Bump aws.version from 1.12.436 to 1.12.437

2023-03-29 Thread via GitHub


dependabot[bot] opened a new pull request, #1047:
URL: https://github.com/apache/tika/pull/1047

   Bumps `aws.version` from 1.12.436 to 1.12.437.
   Updates `aws-java-sdk-s3` from 1.12.436 to 1.12.437
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-s3's
 changelog.
   
   1.12.437 2023-03-28
   AWS IoT Data Plane
   
   
   Features
   
   Add endpoint ruleset support for cn-north-1.
   
   
   
   AWS Systems Manager Incident Manager
   
   
   Features
   
   Increased maximum length of TriggerDetails.rawData to 10K 
characters and IncidentSummary to 8K characters.
   
   
   
   AWS Systems Manager Incident Manager Contacts
   
   
   Features
   
   This release adds 12 new APIs as part of Oncall Schedule feature 
release, adds support for a new contact type: ONCALL_SCHEDULE. Check public 
documentation for AWS ssm-contacts for more information
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/1272d92d9303bee41c917b303c292523b86e61a6;>1272d92
 AWS SDK for Java 1.12.437
   https://github.com/aws/aws-sdk-java/commit/575ccedbdbff28c103a526900767a613a7c394ab;>575cced
 Update GitHub version number to 1.12.437-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.436...1.12.437;>compare 
view
   
   
   
   
   Updates `aws-java-sdk-transcribe` from 1.12.436 to 1.12.437
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-transcribe's
 changelog.
   
   1.12.437 2023-03-28
   AWS IoT Data Plane
   
   
   Features
   
   Add endpoint ruleset support for cn-north-1.
   
   
   
   AWS Systems Manager Incident Manager
   
   
   Features
   
   Increased maximum length of TriggerDetails.rawData to 10K 
characters and IncidentSummary to 8K characters.
   
   
   
   AWS Systems Manager Incident Manager Contacts
   
   
   Features
   
   This release adds 12 new APIs as part of Oncall Schedule feature 
release, adds support for a new contact type: ONCALL_SCHEDULE. Check public 
documentation for AWS ssm-contacts for more information
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/1272d92d9303bee41c917b303c292523b86e61a6;>1272d92
 AWS SDK for Java 1.12.437
   https://github.com/aws/aws-sdk-java/commit/575ccedbdbff28c103a526900767a613a7c394ab;>575cced
 Update GitHub version number to 1.12.437-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.436...1.12.437;>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org