[jira] [Resolved] (TIKA-3993) Improve throttle logic in S3Fetcher
[ https://issues.apache.org/jira/browse/TIKA-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3993. --- Fix Version/s: 2.7.1 Resolution: Fixed > Improve throttle logic in S3Fetcher > --- > > Key: TIKA-3993 > URL: https://issues.apache.org/jira/browse/TIKA-3993 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 2.7.1 > > > We currently have "tries" and sleep ms amounts. We should allow users to set > an array of seconds for backoff to tune their own logarithmic backoff or > other strategies. Further, we're retrying all aws exceptions. We should not > retry AccessDenied, NoSuchKey etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3993) Improve throttle logic in S3Fetcher
[ https://issues.apache.org/jira/browse/TIKA-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706554#comment-17706554 ] ASF GitHub Bot commented on TIKA-3993: -- tballison merged PR #1048: URL: https://github.com/apache/tika/pull/1048 > Improve throttle logic in S3Fetcher > --- > > Key: TIKA-3993 > URL: https://issues.apache.org/jira/browse/TIKA-3993 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > We currently have "tries" and sleep ms amounts. We should allow users to set > an array of seconds for backoff to tune their own logarithmic backoff or > other strategies. Further, we're retrying all aws exceptions. We should not > retry AccessDenied, NoSuchKey etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [tika] tballison merged pull request #1048: TIKA-3993 -- improve S3Fetcher backoff configurability
tballison merged PR #1048: URL: https://github.com/apache/tika/pull/1048 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
next release?
All, PDFBox will likely kick off a release in the next week or so. Any objections to running a Tika release shortly thereafter? I'd like to finish a few small things by then. Anyone have any blockers? I feel that we have enough for a 2.8.0 release instead of a 2.7.1 release. What do you think? Best, Tim
[jira] [Commented] (TIKA-3993) Improve throttle logic in S3Fetcher
[ https://issues.apache.org/jira/browse/TIKA-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706511#comment-17706511 ] ASF GitHub Bot commented on TIKA-3993: -- tballison opened a new pull request, #1048: URL: https://github.com/apache/tika/pull/1048 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Improve throttle logic in S3Fetcher > --- > > Key: TIKA-3993 > URL: https://issues.apache.org/jira/browse/TIKA-3993 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > We currently have "tries" and sleep ms amounts. We should allow users to set > an array of seconds for backoff to tune their own logarithmic backoff or > other strategies. Further, we're retrying all aws exceptions. We should not > retry AccessDenied, NoSuchKey etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [tika] tballison opened a new pull request, #1048: TIKA-3993 -- improve S3Fetcher backoff configurability
tballison opened a new pull request, #1048: URL: https://github.com/apache/tika/pull/1048 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (TIKA-3993) Improve throttle logic in S3Fetcher
Tim Allison created TIKA-3993: - Summary: Improve throttle logic in S3Fetcher Key: TIKA-3993 URL: https://issues.apache.org/jira/browse/TIKA-3993 Project: Tika Issue Type: Task Reporter: Tim Allison We currently have "tries" and sleep ms amounts. We should allow users to set an array of seconds for backoff to tune their own logarithmic backoff or other strategies. Further, we're retrying all aws exceptions. We should not retry AccessDenied, NoSuchKey etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data
[ https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706405#comment-17706405 ] Tim Allison commented on TIKA-3992: --- Ah, that's helpful. Thank you! By "truncated", I was referring to the feature of CC where they truncate fetches at 1MB. So, we really don't have access to the ends of the files unless we refetch from the original URLs, which I am not proposing doing on this ticket. We'll see what we can do with what we have... > Add common missing mimes based on Common Crawl data > --- > > Key: TIKA-3992 > URL: https://issues.apache.org/jira/browse/TIKA-3992 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as > detected by Tika. It would be useful to extract those (even if truncated) > and run 'file' and 'siegfried' against those file types that are unknown to > Tika. We can prioritize the most common file formats as identified by file > and siegfried for addition to our mime-types.xml. > Separately, we might also want to do the same thing for > `application/zip`...there are likely zip-based file types that we could do a > better job on. > Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data
[ https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706400#comment-17706400 ] Andrew Jackson commented on TIKA-3992: -- Sounds interesting! Just wanted to note that Siegfried (and DROID/etc) signatures often require end-of-file matches as well as beginning-of-file, so if you do truncate the files you'll get the best results by chopping out the middle. I'd imagine the first and last few KB should do it. > Add common missing mimes based on Common Crawl data > --- > > Key: TIKA-3992 > URL: https://issues.apache.org/jira/browse/TIKA-3992 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as > detected by Tika. It would be useful to extract those (even if truncated) > and run 'file' and 'siegfried' against those file types that are unknown to > Tika. We can prioritize the most common file formats as identified by file > and siegfried for addition to our mime-types.xml. > Separately, we might also want to do the same thing for > `application/zip`...there are likely zip-based file types that we could do a > better job on. > Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3992) Add common missing mimes based on Common Crawl data
Tim Allison created TIKA-3992: - Summary: Add common missing mimes based on Common Crawl data Key: TIKA-3992 URL: https://issues.apache.org/jira/browse/TIKA-3992 Project: Tika Issue Type: Task Reporter: Tim Allison In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as detected by Tika. It would be useful to extract those (even if truncated) and run 'file' and 'siegfried' against those file types that are unknown to Tika. We can prioritize the most common file formats as identified by file and siegfried for addition to our mime-types.xml. Separately, we might also want to do the same thing for `application/zip`...there are likely zip-based file types that we could do a better job on. Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)
A Message from the Board to PMC members
Dear Apache Project Management Committee (PMC) members, The Board wants to take just a moment of your time to communicate a few things that seem to have been forgotten by a number of PMC members, across the Foundation, over the past few years. Please note that this is being sent to all projects - yours has not been singled out. The Project Management Committee (PMC) as a whole[1] is tasked with the oversight, health, and sustainability of the project. The PMC members are responsible collectively, and individually, for ensuring that the project operates in a way that is in line with ASF philosophy, and in a way that serves the developers and users of the project. The PMC Chair is not the project leader, in any sense. It is the person who files board reports and makes sure they are delivered on time. It is the secretary for the project, and the project’s ambassador to the Board of Directors. The VP title is given as an artifact of US corporate law, and not because the PMC Chair has any special powers. If you are treating your PMC Chair as the project lead, or granting them any other special powers or privileges, you need to be aware that that’s not the intent of the Chair role. The Chair is a PMC member peer with a few extra duties. Every PMC member has an equal voice in deliberations. Each has one vote. Each has veto power. Every vote weighs the same. It is not only your right, but it is your obligation, to use that vote for the good of the project and its users, not to appease the Chair, your employer, or any other voice in the project. Every PMC member can, and should, nominate new committers, and new PMC members. This is not the sole domain of the PMC Chair. This might be your most important responsibility to the project, as succession planning is the path to sustainability. Every PMC member can, and should, respond when the Board sends email to your private list. You should not wait for the PMC Chair to respond. The Board views the entire PMC as responsible for the project, not just one member. Every PMC member should be subscribed to the private@ mailing list. If you are not, then you are neglecting your duty of oversight. If you no longer wish to be responsible for oversight of the project, you should resign your PMC seat, not merely drop off of the private@ list and ignore it. You can determine which PMC members are not subscribed to your private list by looking at your PMC roster at https://whimsy.apache.org/roster/committee/ Names with an asterisk (*) next to them are not subscribed to the list. We encourage you to take a moment to contact them with this information. Thank you for your attention to these matters, and thank you for keeping our projects healthy. Rich, for The Board of Directors [1] https://apache.org/foundation/how-it-works.html#pmc-members
[GitHub] [tika] THausherr merged pull request #1047: Bump aws.version from 1.12.436 to 1.12.437
THausherr merged PR #1047: URL: https://github.com/apache/tika/pull/1047 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] dependabot[bot] opened a new pull request, #1047: Bump aws.version from 1.12.436 to 1.12.437
dependabot[bot] opened a new pull request, #1047: URL: https://github.com/apache/tika/pull/1047 Bumps `aws.version` from 1.12.436 to 1.12.437. Updates `aws-java-sdk-s3` from 1.12.436 to 1.12.437 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-s3's changelog. 1.12.437 2023-03-28 AWS IoT Data Plane Features Add endpoint ruleset support for cn-north-1. AWS Systems Manager Incident Manager Features Increased maximum length of TriggerDetails.rawData to 10K characters and IncidentSummary to 8K characters. AWS Systems Manager Incident Manager Contacts Features This release adds 12 new APIs as part of Oncall Schedule feature release, adds support for a new contact type: ONCALL_SCHEDULE. Check public documentation for AWS ssm-contacts for more information Commits https://github.com/aws/aws-sdk-java/commit/1272d92d9303bee41c917b303c292523b86e61a6;>1272d92 AWS SDK for Java 1.12.437 https://github.com/aws/aws-sdk-java/commit/575ccedbdbff28c103a526900767a613a7c394ab;>575cced Update GitHub version number to 1.12.437-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.436...1.12.437;>compare view Updates `aws-java-sdk-transcribe` from 1.12.436 to 1.12.437 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-transcribe's changelog. 1.12.437 2023-03-28 AWS IoT Data Plane Features Add endpoint ruleset support for cn-north-1. AWS Systems Manager Incident Manager Features Increased maximum length of TriggerDetails.rawData to 10K characters and IncidentSummary to 8K characters. AWS Systems Manager Incident Manager Contacts Features This release adds 12 new APIs as part of Oncall Schedule feature release, adds support for a new contact type: ONCALL_SCHEDULE. Check public documentation for AWS ssm-contacts for more information Commits https://github.com/aws/aws-sdk-java/commit/1272d92d9303bee41c917b303c292523b86e61a6;>1272d92 AWS SDK for Java 1.12.437 https://github.com/aws/aws-sdk-java/commit/575ccedbdbff28c103a526900767a613a7c394ab;>575cced Update GitHub version number to 1.12.437-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.436...1.12.437;>compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org