Re: [PR] Bump org.apache.maven.plugin-tools:maven-plugin-annotations from 3.11.0 to 3.12.0 [tika]
THausherr merged PR #1707: URL: https://github.com/apache/tika/pull/1707 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Bump aws.version from 1.12.693 to 1.12.694 [tika]
THausherr merged PR #1706: URL: https://github.com/apache/tika/pull/1706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Bump org.apache.maven.plugin-tools:maven-plugin-annotations from 3.11.0 to 3.12.0 [tika]
dependabot[bot] opened a new pull request, #1707: URL: https://github.com/apache/tika/pull/1707 Bumps [org.apache.maven.plugin-tools:maven-plugin-annotations](https://github.com/apache/maven-plugin-tools) from 3.11.0 to 3.12.0. Release notes Sourced from https://github.com/apache/maven-plugin-tools/releases;>org.apache.maven.plugin-tools:maven-plugin-annotations's releases. 3.12.0 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12317820version=12354176;>Release Notes - Maven Plugin Tools - Version 3.12.0 Commits https://github.com/apache/maven-plugin-tools/commit/0b69acc43ead6efb336c78895e59749f463cc899;>0b69acc [maven-release-plugin] prepare release maven-plugin-tools-3.12.0 https://github.com/apache/maven-plugin-tools/commit/e5e3dd235d572956d49ea461346466d54aec7195;>e5e3dd2 code simplifications https://github.com/apache/maven-plugin-tools/commit/e97ba772405c20849f5e47aa9b4971ddcbc9826b;>e97ba77 [MPLUGIN-510] group history per common requirements https://github.com/apache/maven-plugin-tools/commit/6f9c3d9371155f6a826f8052e099adb716163689;>6f9c3d9 use https://github.com/Component;>@Component instead of https://github.com/Parameter;>@Parameter when possible https://github.com/apache/maven-plugin-tools/commit/d8fecbc1f8b83e9f38b20bd41bac907a9fa0bef4;>d8fecbc Bump org.codehaus.plexus:plexus-archiver from 4.9.1 to 4.9.2 https://github.com/apache/maven-plugin-tools/commit/a9dd57dbdd83e3bb295b9c1bf39c128b4eefc0d3;>a9dd57d rename mavenVersion to maven3Version https://github.com/apache/maven-plugin-tools/commit/1aad21414b8ad92a880d22ffb2b935b3fefa85d8;>1aad214 [MPLUGIN-514] switch from png+imagemap to svg https://github.com/apache/maven-plugin-tools/commit/ddbaa5b46aff5581af5b16ab6ffec2ba3c466705;>ddbaa5b Bump apache/maven-gh-actions-shared from 3 to 4 https://github.com/apache/maven-plugin-tools/commit/cd747611b768031b57fda87bb8e19845d2dc69fa;>cd74761 [MPLUGIN-511] add versions history requirements detection https://github.com/apache/maven-plugin-tools/commit/d9f8d8941d6996ad39b6f4c427f8e126c1176154;>d9f8d89 [MPLUGIN-511] prepare method to list releases history Additional commits viewable in https://github.com/apache/maven-plugin-tools/compare/maven-plugin-tools-3.11.0...maven-plugin-tools-3.12.0;>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugin-tools:maven-plugin-annotations=maven=3.11.0=3.12.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Bump aws.version from 1.12.693 to 1.12.694 [tika]
dependabot[bot] opened a new pull request, #1706: URL: https://github.com/apache/tika/pull/1706 Bumps `aws.version` from 1.12.693 to 1.12.694. Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.693 to 1.12.694 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-s3's changelog. 1.12.694 2024-04-03 AWS Clean Rooms ML Features The release includes a public SDK for AWS Clean Rooms ML APIs, making them globally available to developers worldwide. AWS CloudFormation Features This release would return a new field - PolicyAction in cloudformation's existed DescribeChangeSetResponse, showing actions we are going to apply on the physical resource (e.g., Delete, Retain) according to the user's template AWS Elemental MediaLive Features Cmaf Ingest outputs are now supported in Media Live AWS Ground Station Features This release adds visibilityStartTime and visibilityEndTime to DescribeContact and ListContacts responses. AWS Health Imaging Features SearchImageSets API now supports following enhancements - Additional support for searching on UpdatedAt and SeriesInstanceUID - Support for searching existing filters between dates/times - Support for sorting the search result by Ascending/Descending - Additional parameters returned in the response AWS Lambda Features Add Ruby 3.3 (ruby3.3) support to AWS Lambda AWS Transfer Family Features Add ability to specify Security Policies for SFTP Connectors Amazon DataZone Features This release supports the feature of dataQuality to enrich asset with dataQualityResult in Amazon DataZone. Amazon DocumentDB with MongoDB compatibility Features This release adds Global Cluster Switchover capability which enables you to change your global cluster's primary AWS Region, the region that serves writes, while preserving the replication between all regions in the global cluster. Commits https://github.com/aws/aws-sdk-java/commit/4a46bf329f920857481ff49ce831da7110691589;>4a46bf3 AWS SDK for Java 1.12.694 https://github.com/aws/aws-sdk-java/commit/5de4081bbbf8abdbc07945a3c06968966724e825;>5de4081 Update GitHub version number to 1.12.694-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.693...1.12.694;>compare view Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.693 to 1.12.694 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-transcribe's changelog. 1.12.694 2024-04-03 AWS Clean Rooms ML Features The release includes a public SDK for AWS Clean Rooms ML APIs, making them globally available to developers worldwide. AWS CloudFormation Features This release would return a new field - PolicyAction in cloudformation's existed DescribeChangeSetResponse, showing actions we are going to apply on the physical resource (e.g., Delete, Retain) according to the user's template AWS Elemental MediaLive Features Cmaf Ingest outputs are now supported in Media Live AWS Ground Station Features This release adds visibilityStartTime and visibilityEndTime to DescribeContact and ListContacts responses. AWS Health Imaging Features SearchImageSets API now supports following enhancements - Additional support for searching on UpdatedAt and SeriesInstanceUID - Support for searching existing filters between dates/times - Support for sorting the search result by Ascending/Descending - Additional parameters returned in the response AWS Lambda Features Add Ruby 3.3 (ruby3.3) support to AWS Lambda AWS Transfer Family Features Add ability to specify Security Policies for SFTP Connectors Amazon DataZone Features This release supports the feature of dataQuality to enrich asset with dataQualityResult in Amazon DataZone. Amazon DocumentDB with MongoDB compatibility Features This release adds Global Cluster Switchover capability which enables you to change your global cluster's primary AWS Region, the region that serves writes, while preserving the replication between all regions in the global cluster. Commits https://github.com/aws/aws-sdk-java/commit/4a46bf329f920857481ff49ce831da7110691589;>4a46bf3 AWS SDK for Java 1.12.694 https://github.com/aws/aws-sdk-java/commit/5de4081bbbf8abdbc07945a3c06968966724e825;>5de4081 Update GitHub version number to
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833807#comment-17833807 ] Tilman Hausherr commented on TIKA-4231: --- Yes it is text, but the PDF is using a feature that we don't support. Instead of having its own unicode for each glyph, it has the text extraction on a separate level. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
OCR dataset
Hi devs, I saw this dataset on Hugging Face, seems useful for evaluating Tika OCR… — Ken https://huggingface.co/datasets/pixparse/idl-wds
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833745#comment-17833745 ] Tim Allison commented on TIKA-4231: --- On some PDFs, there can be problems with Unicode mappings and other glyph issues. For some of these files, they render well but the underlying electronic text is junk. In those cases, OCR is the best option. I haven’t looked at this pdf and don’t know if the above is the case for this one. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833745#comment-17833745 ] Tim Allison edited comment on TIKA-4231 at 4/3/24 9:18 PM: --- On some PDFs, there can be problems with Unicode mappings and other glyph/font issues. For some of these files, they render well but the underlying electronic text is junk. In those cases, OCR is the best option. I haven’t looked at this pdf and don’t know if the above is the case for this one. was (Author: talli...@mitre.org): On some PDFs, there can be problems with Unicode mappings and other glyph issues. For some of these files, they render well but the underlying electronic text is junk. In those cases, OCR is the best option. I haven’t looked at this pdf and don’t know if the above is the case for this one. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833740#comment-17833740 ] Aamir commented on TIKA-4231: - Why use OCR? This is text, not images. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Community over Code EU 2024: Start planning your trip!
[Note: You're receiving this email because you are subscribed to one or more project dev@ mailing lists at the Apache Software Foundation.] Dear community, We hope you are doing great, are you ready for Community Over Code EU? Check out the featured sessions, get your tickets with special discounts and start planning your trip. Save your spot! Take a look at our lineup of sessions, panelists and featured speakers and make your final choice: * EU policies and regulations affecting open source specialists working in OSPOs The panel will discuss how EU legislation affects the daily work of open source operations. Panelists will cover some recent policy updates, the challenges of staying compliant when managing open source contribution and usage within organizations, and their personal experiences in adapting to the changing European regulatory environment. * Doing for sustainability, what open source did for software In this keynote Asim Hussain will explain the history of Impact Framework, a coalition of hundreds of software practitioners with tangible solutions that directly foster meaningful change by measuring the environmental impacts of a piece of software. Don’t forget that we have special discounts for groups, students and Apache committers. Visit the website to discover more about these rates.[1] It's time for you to start planning your trip. Remember that we have prepared a “How to get there” guide that will be helpful to find out the best transportation, either train, bus, flight or boat to Bratislava from wherever you are coming from. Take a look at the different options and please reach out to us if you have any questions. We have available rooms -with a special rate- at the Radisson Blu Carlton Hotel, where the event will take place and at the Park Inn Hotel which is only 5 minutes walking from the venue. [2] However, you are free to choose any other accommodation options around the city. See you in Bratislava, Community Over Code EU Team [1]: https://eu.communityovercode.org/tickets/ "Register" [2]: https://eu.communityovercode.org/venue/ "Where to stay"
Participate in the ASF 25th Anniversary Campaign
Hi everyone, As part of The ASF’s 25th anniversary campaign[1], we will be celebrating projects and communities in multiple ways. We invite all projects and contributors to participate in the following ways: * Individuals - submit your first contribution: https://news.apache.org/foundation/entry/the-asf-launches-firstasfcontribution-campaign * Projects - share your public good story: https://docs.google.com/forms/d/1vuN-tUnBwpTgOE5xj3Z5AG1hsOoDNLBmGIqQHwQT6k8/viewform?edit_requested=true * Projects - submit a project spotlight for the blog: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278466116 * Projects - contact the Voice of Apache podcast (formerly Feathercast) to be featured: https://feathercast.apache.org/help/ * Projects - use the 25th anniversary template and the #ASF25Years hashtag on social media: https://docs.google.com/presentation/d/1oDbMol3F_XQuCmttPYxBIOIjRuRBksUjDApjd8Ve3L8/edit#slide=id.g26b0919956e_0_13 If you have questions, email the Marketing & Publicity team at mark...@apache.org. Peace, BKP [1] https://apache.org/asf25years/ [NOTE: You are receiving this message because you are a contributor to an Apache Software Foundation project. The ASF will very occasionally send out messages relating to the Foundation to contributors and members, such as this one.] Brian Proffitt VP, Marketing & Publicity VP, Conferences
Re: [PR] Bump aws.version from 1.12.692 to 1.12.693 [tika]
THausherr merged PR #1705: URL: https://github.com/apache/tika/pull/1705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Bump aws.version from 1.12.692 to 1.12.693 [tika]
dependabot[bot] opened a new pull request, #1705: URL: https://github.com/apache/tika/pull/1705 Bumps `aws.version` from 1.12.692 to 1.12.693. Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.692 to 1.12.693 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-s3's changelog. 1.12.693 2024-04-02 AWS Glue Features Adding View related fields to responses of read-only Table APIs. AWS SecurityHub Features Documentation updates for AWS Security Hub Amazon EC2 Container Service Features Documentation only update for Amazon ECS. Amazon Interactive Video Service Chat Features Doc-only update. Changed Resources to Key Concepts in docs and updated text. IAM Roles Anywhere Features This release increases the limit on the roleArns request parameter for the *Profile APIs that support it. This parameter can now take up to 250 role ARNs. Commits https://github.com/aws/aws-sdk-java/commit/c2aacfe55a561d4b5153fcee111f4f9c21eb7e5d;>c2aacfe AWS SDK for Java 1.12.693 https://github.com/aws/aws-sdk-java/commit/f740c212ad0f22621fa3fcca59aa02d35b67de5f;>f740c21 Update GitHub version number to 1.12.693-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.692...1.12.693;>compare view Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.692 to 1.12.693 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-transcribe's changelog. 1.12.693 2024-04-02 AWS Glue Features Adding View related fields to responses of read-only Table APIs. AWS SecurityHub Features Documentation updates for AWS Security Hub Amazon EC2 Container Service Features Documentation only update for Amazon ECS. Amazon Interactive Video Service Chat Features Doc-only update. Changed Resources to Key Concepts in docs and updated text. IAM Roles Anywhere Features This release increases the limit on the roleArns request parameter for the *Profile APIs that support it. This parameter can now take up to 250 role ARNs. Commits https://github.com/aws/aws-sdk-java/commit/c2aacfe55a561d4b5153fcee111f4f9c21eb7e5d;>c2aacfe AWS SDK for Java 1.12.693 https://github.com/aws/aws-sdk-java/commit/f740c212ad0f22621fa3fcca59aa02d35b67de5f;>f740c21 Update GitHub version number to 1.12.693-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.692...1.12.693;>compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org