[GitHub] [tika] dependabot[bot] opened a new pull request, #701: Bump google-cloud-storage from 2.11.3 to 2.12.0
dependabot[bot] opened a new pull request, #701: URL: https://github.com/apache/tika/pull/701 Bumps [google-cloud-storage](https://github.com/googleapis/java-storage) from 2.11.3 to 2.12.0. Release notes Sourced from https://github.com/googleapis/java-storage/releases;>google-cloud-storage's releases. v2.12.0 https://github.com/googleapis/java-storage/compare/v2.11.3...v2.12.0;>2.12.0 (2022-09-15) Features Add toString method for CustomPlacementConfig (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1602;>#1602) (https://github.com/googleapis/java-storage/commit/51aca10fafe685ed9e7cb41bc4ae79be10feb080;>51aca10) Documentation Add batch sample (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1559;>#1559) (https://github.com/googleapis/java-storage/commit/583bf73f5d58aa5d79fbaa12b24407c558235eed;>583bf73) Document thread safety of library (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1566;>#1566) (https://github.com/googleapis/java-storage/commit/c7408999e811ba917edb0c136432afa29075e0f2;>c740899) Fix broken links in readme (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1520;>#1520) (https://github.com/googleapis/java-storage/commit/840b08a03fa7c0535855140244c282f79403b458;>840b08a) Dependencies Update dependency com.google.cloud:google-cloud-shared-dependencies to v3.0.2 (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1611;>#1611) (https://github.com/googleapis/java-storage/commit/8a48aea7e0049c64ef944b532a2874115b1e2323;>8a48aea) Update dependency com.google.cloud:google-cloud-shared-dependencies to v3.0.3 (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1620;>#1620) (https://github.com/googleapis/java-storage/commit/20e63785462e7876a7ff0ca1363007cc160f;>20e6378) Changelog Sourced from https://github.com/googleapis/java-storage/blob/main/CHANGELOG.md;>google-cloud-storage's changelog. https://github.com/googleapis/java-storage/compare/v2.11.3...v2.12.0;>2.12.0 (2022-09-15) Features Add toString method for CustomPlacementConfig (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1602;>#1602) (https://github.com/googleapis/java-storage/commit/51aca10fafe685ed9e7cb41bc4ae79be10feb080;>51aca10) Documentation Add batch sample (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1559;>#1559) (https://github.com/googleapis/java-storage/commit/583bf73f5d58aa5d79fbaa12b24407c558235eed;>583bf73) Document thread safety of library (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1566;>#1566) (https://github.com/googleapis/java-storage/commit/c7408999e811ba917edb0c136432afa29075e0f2;>c740899) Fix broken links in readme (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1520;>#1520) (https://github.com/googleapis/java-storage/commit/840b08a03fa7c0535855140244c282f79403b458;>840b08a) Dependencies Update dependency com.google.cloud:google-cloud-shared-dependencies to v3.0.2 (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1611;>#1611) (https://github.com/googleapis/java-storage/commit/8a48aea7e0049c64ef944b532a2874115b1e2323;>8a48aea) Update dependency com.google.cloud:google-cloud-shared-dependencies to v3.0.3 (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1620;>#1620) (https://github.com/googleapis/java-storage/commit/20e63785462e7876a7ff0ca1363007cc160f;>20e6378) Commits https://github.com/googleapis/java-storage/commit/932259e9a744081b5416c9fb582af519b4360146;>932259e chore(main): release 2.12.0 (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1565;>#1565) https://github.com/googleapis/java-storage/commit/20e63785462e7876a7ff0ca1363007cc160f;>20e6378 deps: update dependency com.google.cloud:google-cloud-shared-dependencies to ... https://github.com/googleapis/java-storage/commit/5915383e68cb99d416f2b50b7f924a91b788ad13;>5915383 test(deps): update dependency com.google.cloud:google-cloud-pubsub to v1.120 https://github.com/googleapis/java-storage/commit/c4432fda450b9cb6f03c984e0c0d89e4d71f3c6c;>c4432fd chore(bazel): Update WORKSPACE files for rules_gapic, gax_java, generator_jav... https://github.com/googleapis/java-storage/commit/c779dde5724ddc2153be06c6fae72ac4bb325e07;>c779dde test(deps): update dependency com.google.cloud:google-cloud-pubsub to v1.120 https://github.com/googleapis/java-storage/commit/3ef792fd180023ed63e2790554e2cfb772651f5a;>3ef792f test(deps): update dependency org.mockito:mockito-core to v4.8.0 (https://github-redirect.dependabot.com/googleapis/java-storage/issues/1609;>#1609) https://github.com/googleapis/java-storage/commit/34f2aa85e975293c7358be9b955b3bea257e9815;>34f2aa8
[GitHub] [tika] dependabot[bot] opened a new pull request, #700: Bump spring-context from 5.3.22 to 5.3.23
dependabot[bot] opened a new pull request, #700: URL: https://github.com/apache/tika/pull/700 Bumps [spring-context](https://github.com/spring-projects/spring-framework) from 5.3.22 to 5.3.23. Release notes Sourced from https://github.com/spring-projects/spring-framework/releases;>spring-context's releases. v5.3.23 :star: New Features Introduce AnnotationUtils.isSynthesizedAnnotation(Annotation) https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/29054;>#29054 Introduce createContext() factory method in AbstractGenericWebContextLoader https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28983;>#28983 Support TreeSet collection type in CollectionFactory.createCollection() without using reflection https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28949;>#28949 Document when RequestEntity.getUrl() throws an UnsupportedOperationException https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28930;>#28930 Deprecate NestedIOException https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28929;>#28929 Make isConnected() in WebSocketConnectionManager public https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28785;>#28785 Expose headers from STOMP RECEIPT frame to registered callbacks https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28715;>#28715 Make WebClientException serializable https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28321;>#28321 :lady_beetle: Bug Fixes Ordering inconsistency with beans defined in parent context https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/29105;>#29105 RelativeRedirectResponseWrapper does not commit response in sendRedirect https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/29050;>#29050 MockServerContainerContextCustomizerFactory does not support @Nested tests https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/29037;>#29037 Request to improve KotlinSerializationJsonHttpMessageConverter logic in RestTemplate https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/29008;>#29008 WebFlux: multipart requests hang sometimes https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28963;>#28963 DataBufferUtils.write(Publisher, Path) loses context https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28933;>#28933 connectionTimeOut and readTimeout not working on UrlResource https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28909;>#28909 SockJsServiceRegistration#setSupressCors has a typo and should be deprecated https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28853;>#28853 RenderingResponse does not set status code on redirect views https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28839;>#28839 Avoid IllegalArgumentException when setting WebSocket error status https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28836;>#28836 Loss of context path after using ServerRequest.from https://github-redirect.dependabot.com/spring-projects/spring-framework/issues/28820;>#28820 ResponseCookie does not declare nullability annotations consistently for domain and path https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28780;>#28780 :notebook_with_decorative_cover: Documentation Fix typo in data-access section https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/29048;>#29048 Correct description of @RequestParam with WebFlux https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28944;>#28944 Fix broken kdoc-api links in kotlin.adoc https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28908;>#28908 Fix typos in Javadoc of class AbstractEncoder https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28885;>#28885 Fix links in Javadoc and reference docs https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28876;>#28876 Add missing closing parenthesis in reference doc https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28867;>#28867 Fix typos in Javadoc, reference docs, and code https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28822;>#28822 Replace use of the tt HTML tag in Javadoc https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28819;>#28819 Fix broken link in rsocket documentation https://github-redirect.dependabot.com/spring-projects/spring-framework/pull/28817;>#28817 Clarify docs on JNDI properties in Servlet environment
[GitHub] [tika] dependabot[bot] opened a new pull request, #699: Bump aws.version from 1.12.303 to 1.12.304
dependabot[bot] opened a new pull request, #699: URL: https://github.com/apache/tika/pull/699 Bumps `aws.version` from 1.12.303 to 1.12.304. Updates `aws-java-sdk-transcribe` from 1.12.303 to 1.12.304 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-transcribe's changelog. 1.12.304 2022-09-15 Amazon DynamoDB Features Increased DynamoDB transaction limit from 25 to 100. Amazon Elastic Compute Cloud Features This feature allows customers to create tags for vpc-endpoint-connections and vpc-endpoint-service-permissions. Amazon SageMaker Service Features Amazon SageMaker Automatic Model Tuning now supports specifying Hyperband strategy for tuning jobs, which uses a multi-fidelity based tuning strategy to stop underperforming hyperparameter configurations early. Commits https://github.com/aws/aws-sdk-java/commit/6550dbc6d5b2c12118eecd88ac325857251a0909;>6550dbc AWS SDK for Java 1.12.304 https://github.com/aws/aws-sdk-java/commit/ee307b365c9c979c81cfd5f32990de045599f064;>ee307b3 Update GitHub version number to 1.12.304-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.303...1.12.304;>compare view Updates `aws-java-sdk-s3` from 1.12.303 to 1.12.304 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-s3's changelog. 1.12.304 2022-09-15 Amazon DynamoDB Features Increased DynamoDB transaction limit from 25 to 100. Amazon Elastic Compute Cloud Features This feature allows customers to create tags for vpc-endpoint-connections and vpc-endpoint-service-permissions. Amazon SageMaker Service Features Amazon SageMaker Automatic Model Tuning now supports specifying Hyperband strategy for tuning jobs, which uses a multi-fidelity based tuning strategy to stop underperforming hyperparameter configurations early. Commits https://github.com/aws/aws-sdk-java/commit/6550dbc6d5b2c12118eecd88ac325857251a0909;>6550dbc AWS SDK for Java 1.12.304 https://github.com/aws/aws-sdk-java/commit/ee307b365c9c979c81cfd5f32990de045599f064;>ee307b3 Update GitHub version number to 1.12.304-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.303...1.12.304;>compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] dependabot[bot] opened a new pull request, #698: Bump jetty.version from 9.4.48.v20220622 to 9.4.49.v20220914
dependabot[bot] opened a new pull request, #698: URL: https://github.com/apache/tika/pull/698 Bumps `jetty.version` from 9.4.48.v20220622 to 9.4.49.v20220914. Updates `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914 Release notes Sourced from https://github.com/eclipse/jetty.project/releases;>jetty-http's releases. 9.4.49.v20220914 Changelog https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8578;>#8578 - getRequestURL can append null if getRequestURI is unspecified in an authority-form request-target https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8493;>#8493 - Review HTTP client feature setRemoveIdleDestinations Dependencies https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8253;>#8253 - Bump google-cloud-datastore to 2.9.1 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8233;>#8233 - Bump jna to 5.12.1 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8242;>#8242 - Bump mariadb-java-client to 3.0.6 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8238;>#8238 - Bump maven-enforcer-plugin to 3.1.0 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8230;>#8230 - Bump maven.version to 3.8.6 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8246;>#8246 - Bump org.eclipse.osgi to 3.18.0 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8245;>#8245 - Bump testcontainers.version to 1.17.3 Commits https://github.com/eclipse/jetty.project/commit/4231a3b2e4cb8548a412a789936d640a97b1aa0a;>4231a3b Updating to version 9.4.49.v20220914 https://github.com/eclipse/jetty.project/commit/b32d739a1d158c270b98c300e9b84af245bfde2d;>b32d739 Merge pull request https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8579;>#8579 from eclipse/fix/jetty-9.4.x-abstractproxy-null-requ... https://github.com/eclipse/jetty.project/commit/5944ff4b3a0aa0b9c2a5ad4048fd497e6d7a23cf;>5944ff4 Issue https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8578;>#8578 - Changes from review https://github.com/eclipse/jetty.project/commit/48c16deb21efd67d369675a9126e68459fdc9408;>48c16de Issue https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8578;>#8578 - test both request URL/URI results https://github.com/eclipse/jetty.project/commit/d3c7ee3d71c57a32336481df7246c49ff51282b1;>d3c7ee3 Issue https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8578;>#8578 - restore backward compat of getRequestURL and getRequestURI when... https://github.com/eclipse/jetty.project/commit/06f2fa41ddd83236a8484572e93fb3363c2084ad;>06f2fa4 Jetty 9.4.x : fix client remove idle destinations (https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8495;>#8495) https://github.com/eclipse/jetty.project/commit/940455b01274d957075166d53e9b908b27ed7ad6;>940455b https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8414;>#8414: fix drainTo when head == tail but the queue isn't empty https://github.com/eclipse/jetty.project/commit/a846f4fc9dc734d40084f58af44ac925c0ba0aa8;>a846f4f Updating for published CVES (https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8273;>#8273) https://github.com/eclipse/jetty.project/commit/064682b4ce57282e49a80a64b6d7a7a66fb47b28;>064682b Merge pull request https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8253;>#8253 from eclipse/dependabot/maven/jetty-9.4.x/com.google... https://github.com/eclipse/jetty.project/commit/7b4057142ed44a29849f24a2572d2649e9458921;>7b40571 Merge pull request https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8245;>#8245 from eclipse/dependabot/maven/jetty-9.4.x/testcontai... Additional commits viewable in https://github.com/eclipse/jetty.project/compare/jetty-9.4.48.v20220622...jetty-9.4.49.v20220914;>compare view Updates `jetty-io` from 9.4.48.v20220622 to 9.4.49.v20220914 Release notes Sourced from https://github.com/eclipse/jetty.project/releases;>jetty-io's releases. 9.4.49.v20220914 Changelog https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8578;>#8578 - getRequestURL can append null if getRequestURI is unspecified in an authority-form request-target https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8493;>#8493 - Review HTTP client feature setRemoveIdleDestinations Dependencies https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8253;>#8253 - Bump google-cloud-datastore to 2.9.1 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8233;>#8233 - Bump jna to 5.12.1 https://github-redirect.dependabot.com/eclipse/jetty.project/issues/8242;>#8242 - Bump mariadb-java-client to 3.0.6
[jira] [Closed] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-3858. - Resolution: Duplicate > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605596#comment-17605596 ] Tilman Hausherr commented on TIKA-3858: --- No, except OCR. There will always be files with incomplete extraction. I don't understand why Chrome is producing these weird (but legit) files, the /ToUnicode syntax supports ligatures. ActualText support is not being worked on at this time. I have added your name in the watchers list. I'll close this issue because it isn't the fault of tika. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605580#comment-17605580 ] tom hill commented on TIKA-3858: Ok, thanks. Is there anything I can do as a Tika user to work around this issue? Is ActualText support being considered? > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858 ] Tilman Hausherr deleted comment on TIKA-3858: --- was (Author: tilman): Please attach the problematic file, and compare to what you get with Adobe Reader. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858 ] Tilman Hausherr deleted comment on TIKA-3858: --- was (Author: JIRAUSER295805): Apologies, I was still editing the cloned issue. You are responding to the old text. I will update. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858 ] Tilman Hausherr deleted comment on TIKA-3858: --- was (Author: JIRAUSER295805): Ok, the description has been updated. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858 ] Tilman Hausherr deleted comment on TIKA-3858: --- was (Author: tilman): The current PDFBox version (2.0.26) doesn't use it. It's used in PDFBox 1.8.17 which has many drawbacks. The latest tika version is 2.4.1, please try that one. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3858: -- Labels: ActualText (was: ) > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605576#comment-17605576 ] Tilman Hausherr commented on TIKA-3858: --- The font has an incorrect /ToUnicode stream which can be found at {{Root/Pages/Kids/[0]/Resources/Font/F7/ToUnicode}} with PDFDebugger. The incorrect line is {{<8D> <>}} i.e. it maps the 8D code to 0. However the page content stream corrects this with the {{ActualText}} feature that we don't support {code} /P << /MCID 8 >> BDC /F7 14 Tf 1 0 0 -1 64 293 Tm (\015) Tj 9.491989 0 Td (j) Tj 5.026001 0 Td (>) Tj /Span << /ActualText (ft) >> BDC 7.6019897 0 Td (\215) Tj EMC 9.673996 0 Td (k) Tj EMC {code} More on this in PDFBOX-4532 and PDFBOX-5155. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605518#comment-17605518 ] tom hill commented on TIKA-3858: When I open TikaChromeInboxLigature.pdf in Adobe reader, the word "Drafts" uses the ft ligature. I can tell by selecting one character at a time. When I copy the word Drafts and paste it into TextEdit, I get "f" and "t" as separate characters. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3856) Upgrade to jempbox 1.8.17
[ https://issues.apache.org/jira/browse/TIKA-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605517#comment-17605517 ] Hudson commented on TIKA-3856: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #800 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/800/]) TIKA-3856 -- upgrade jempbox to 1.8.17 (tallison: [https://github.com/apache/tika/commit/1b593b12146867ba8827ee55e7e64b01ccb4533c]) * (edit) CHANGES.txt * (edit) tika-parent/pom.xml > Upgrade to jempbox 1.8.17 > - > > Key: TIKA-3856 > URL: https://issues.apache.org/jira/browse/TIKA-3856 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.5.0 > > > Vote passed. In release process now. Many thanks to [~lehmi] [~tilman] and > our PDFBox colleagues! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605512#comment-17605512 ] tom hill edited comment on TIKA-3858 at 9/15/22 8:14 PM: - For the attachment TikaChromeInboxLigature.pdf % java -jar tika-app-2.4.1.jar TikaChromeInboxLigature.pdf | grep Dra | hexdump -C 3c 70 3e 44 72 61 ef bf bd 73 0a |Dra...s.| 000b I believe that is 0xFFFD for the replacement character. was (Author: JIRAUSER295805): For the attachment TikaChromeInboxLigature.pdf % java -jar tika-app-2.4.1.jar TikaChromeInboxLigature.pdf | grep Dra | hexdump -C 3c 70 3e 44 72 61 ef bf bd 73 0a |Dra...s.| 000b > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605512#comment-17605512 ] tom hill commented on TIKA-3858: For the attachment TikaChromeInboxLigature.pdf % java -jar tika-app-2.4.1.jar TikaChromeInboxLigature.pdf | grep Dra | hexdump -C 3c 70 3e 44 72 61 ef bf bd 73 0a |Dra...s.| 000b > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom hill updated TIKA-3858: --- Attachment: TikaChromeInboxLigature.pdf > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3858: -- Affects Version/s: 2.4.1 (was: 1.5) > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605486#comment-17605486 ] Tilman Hausherr commented on TIKA-3858: --- Please attach the problematic file, and compare to what you get with Adobe Reader. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605485#comment-17605485 ] tom hill commented on TIKA-3858: Ok, the description has been updated. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom hill updated TIKA-3858: --- Description: It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark. As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t. This particular example comes from saving my gmail inbox page as a pdf, in chrome. It uses the ft ligature in the word "Drafts". There are many similar examples, it's not specific to one pdf generator. I'm using tika-app-2.4.1.jar was: It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark. As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t. This particular example comes from saving my gmail inbox page as a pdf, in chrome. It uses the ft ligature in the word "Drafts". There are many similar examples, it's not specific to one pdf generator. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom hill updated TIKA-3858: --- Description: It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark. As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t. This particular example comes from saving my gmail inbox page as a pdf, in chrome. It uses the ft ligature in the word "Drafts". There are many similar examples, it's not specific to one pdf generator. was: It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark. As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom hill updated TIKA-3858: --- Description: It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark. As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t. was:It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605482#comment-17605482 ] tom hill commented on TIKA-3858: Apologies, I was still editing the cloned issue. You are responding to the old text. I will update. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tom hill updated TIKA-3858: --- Description: It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark. (was: According to tika sources review, it uses pdfbox to parse pdf files. I found that pdfbox itself uses icu4j to handle ligatures. Unfortunately, when i added icu4j jar to my classpath nothing changed, ligatures are still not converted. Sample pdf file is attached.) > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605481#comment-17605481 ] Tilman Hausherr commented on TIKA-3858: --- The current PDFBox version (2.0.26) doesn't use it. It's used in PDFBox 1.8.17 which has many drawbacks. The latest tika version is 2.4.1, please try that one. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > According to tika sources review, it uses pdfbox to parse pdf files. > I found that pdfbox itself uses icu4j to handle ligatures. > Unfortunately, when i added icu4j jar to my classpath nothing changed, > ligatures are still not converted. Sample pdf file is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3858: -- Fix Version/s: (was: 1.7) > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > > According to tika sources review, it uses pdfbox to parse pdf files. > I found that pdfbox itself uses icu4j to handle ligatures. > Unfortunately, when i added icu4j jar to my classpath nothing changed, > ligatures are still not converted. Sample pdf file is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3858) Ligatures convert on text extraction
tom hill created TIKA-3858: -- Summary: Ligatures convert on text extraction Key: TIKA-3858 URL: https://issues.apache.org/jira/browse/TIKA-3858 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: win 8, jre 1.5 Reporter: tom hill Fix For: 1.7 According to tika sources review, it uses pdfbox to parse pdf files. I found that pdfbox itself uses icu4j to handle ligatures. Unfortunately, when i added icu4j jar to my classpath nothing changed, ligatures are still not converted. Sample pdf file is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3855) Implement upsert for OpenSearch emitter
[ https://issues.apache.org/jira/browse/TIKA-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605474#comment-17605474 ] Hudson commented on TIKA-3855: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #799 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/799/]) TIKA-3855 -- enable upsert for OpenSearchEmitter (tallison: [https://github.com/apache/tika/commit/0abebdc27dbfc9eb34abc3619e13ef816af9e331]) * (edit) tika-pipes/tika-emitters/tika-emitter-opensearch/src/main/java/org/apache/tika/pipes/emitter/opensearch/OpenSearchClient.java * (edit) tika-integration-tests/tika-pipes-opensearch-integration-tests/src/test/java/org/apache/tika/pipes/opensearch/tests/TikaPipesOpenSearchTest.java * (edit) tika-integration-tests/tika-pipes-opensearch-integration-tests/src/test/java/org/apache/tika/pipes/xsearch/tests/XSearchTestClient.java * (edit) CHANGES.txt * (edit) tika-integration-tests/tika-pipes-opensearch-integration-tests/src/test/java/org/apache/tika/pipes/xsearch/tests/TikaPipesXSearchBase.java * (edit) tika-integration-tests/tika-pipes-opensearch-integration-tests/src/test/resources/opensearch/tika-config-opensearch.xml * (edit) tika-pipes/tika-emitters/tika-emitter-opensearch/src/test/java/org/apache/tika/pipes/emitter/opensearch/OpenSearchClientTest.java * (edit) tika-pipes/tika-emitters/tika-emitter-opensearch/src/main/java/org/apache/tika/pipes/emitter/opensearch/OpenSearchEmitter.java > Implement upsert for OpenSearch emitter > --- > > Key: TIKA-3855 > URL: https://issues.apache.org/jira/browse/TIKA-3855 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3856) Upgrade to jempbox 1.8.17
[ https://issues.apache.org/jira/browse/TIKA-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3856. --- Fix Version/s: 2.5.0 Resolution: Fixed > Upgrade to jempbox 1.8.17 > - > > Key: TIKA-3856 > URL: https://issues.apache.org/jira/browse/TIKA-3856 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.5.0 > > > Vote passed. In release process now. Many thanks to [~lehmi] [~tilman] and > our PDFBox colleagues! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3857) Upgrade to POI 5.2.3
Tim Allison created TIKA-3857: - Summary: Upgrade to POI 5.2.3 Key: TIKA-3857 URL: https://issues.apache.org/jira/browse/TIKA-3857 Project: Tika Issue Type: Task Reporter: Tim Allison Ran the regression tests today, and all looks good: https://corpora.tika.apache.org/base/reports/tika-2.5.0-poi-reports.tgz Vote wraps up tomorrow. :fingers-crossed: -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3856) Upgrade to jempbox 1.8.17
Tim Allison created TIKA-3856: - Summary: Upgrade to jempbox 1.8.17 Key: TIKA-3856 URL: https://issues.apache.org/jira/browse/TIKA-3856 Project: Tika Issue Type: Task Reporter: Tim Allison Vote passed. In release process now. Many thanks to [~lehmi] [~tilman] and our PDFBox colleagues! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3855) Implement upsert for OpenSearch emitter
[ https://issues.apache.org/jira/browse/TIKA-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3855. --- Fix Version/s: 2.5.0 Resolution: Fixed > Implement upsert for OpenSearch emitter > --- > > Key: TIKA-3855 > URL: https://issues.apache.org/jira/browse/TIKA-3855 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 2.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3855) Implement upsert for OpenSearch emitter
Tim Allison created TIKA-3855: - Summary: Implement upsert for OpenSearch emitter Key: TIKA-3855 URL: https://issues.apache.org/jira/browse/TIKA-3855 Project: Tika Issue Type: Task Reporter: Tim Allison -- This message was sent by Atlassian Jira (v8.20.10#820010)
RE: Issue related to file mime type detection
On Thu, 15 Sep 2022, Sindhu Mahadevappa wrote: We have been looking for the latest Tika 2.4.1 jar file, looks like it is not available anywhere. You can get the Tika App and Tika Server jars for 2.4.1 from https://tika.apache.org/download.html For the core and parser jars, manually downloading is not recommended as you risk missing dependencies. Just ask Maven or Gradle and they'll pull the latest jars for you Nick
RE: Issue related to file mime type detection
Hi Team, Thanks for the quick response. We have been looking for the latest Tika 2.4.1 jar file, looks like it is not available anywhere. Can you please share the link where we can get the latest 2.4.1 jar file, it will be very helpful. Thanks & Regards Sindhu Mahadevappa > -Original Message- > From: Nick Burch > Sent: Friday, September 9, 2022 3:48 PM > To: Sindhu Mahadevappa > Cc: dev@tika.apache.org > Subject: Re: Issue related to file mime type detection > > [EXTERNAL] This message originated from outside of ArisGlobal. Please treat > hyperlinks, attachments, and instructions in this email with caution. > ArisGlobal will not ask for you for credentials in any email. > > On Fri, 9 Sep 2022, Sindhu Mahadevappa wrote: >> We are using tika-parsers 1.23 > > Tika 1.23 was released in December 2019! You should really use > something much more recent > >> for comparing uploaded file mime type from file name as well as from >> file content for security purpose. > > Apache Tika's detection is not recommended for security purposes. We try our > best to give an answer. Our detection does not defend against specially > crafted files which look like one type but is actually a different one. > >> mime type from file name as audio/mp4 and mine type from file content >> as >> video/mp4 so it is validating as file type not supported. > > Try with a more recent version of Apache Tika. Make sure you include > the Tika Parsers jar and dependencies for container aware detection > within MP4 files. If you still have an issue with Tika 2.4.1, raise a > bug and upload a triggering file so we can investigate > > Nick > This email and any files transmitted with it are confidential and intended > solely for the use of the individual or entity to whom they are addressed. If > you are not the named addressee you should not disseminate, distribute or > copy this e-mail. Please notify the sender or system manager by email > immediately if you have received this e-mail by mistake and delete this > e-mail from your system. If you are not the intended recipient you are > notified that disclosing, copying, distributing or taking any action in > reliance on the contents of this information is strictly prohibited and > against the law. > This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender or system manager by email immediately if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited and against the law.
[GitHub] [tika] THausherr merged pull request #696: Bump aws.version from 1.12.301 to 1.12.303
THausherr merged PR #696: URL: https://github.com/apache/tika/pull/696 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] THausherr merged pull request #697: Bump maven-shade-plugin from 3.3.0 to 3.4.0
THausherr merged PR #697: URL: https://github.com/apache/tika/pull/697 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] THausherr merged pull request #695: Bump protobuf-java from 3.21.5 to 3.21.6
THausherr merged PR #695: URL: https://github.com/apache/tika/pull/695 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] dependabot[bot] opened a new pull request, #697: Bump maven-shade-plugin from 3.3.0 to 3.4.0
dependabot[bot] opened a new pull request, #697: URL: https://github.com/apache/tika/pull/697 Bumps [maven-shade-plugin](https://github.com/apache/maven-shade-plugin) from 3.3.0 to 3.4.0. Commits https://github.com/apache/maven-shade-plugin/commit/885de678577573111568e80b45869a90e2a8fb46;>885de67 [maven-release-plugin] prepare release maven-shade-plugin-3.4.0 https://github.com/apache/maven-shade-plugin/commit/dc8f0679c129238813ea797ccebe690b53380eb4;>dc8f067 Revert [maven-release-plugin] prepare release maven-shade-plugin-3.3.1 https://github.com/apache/maven-shade-plugin/commit/dcd5caed85dbec16d8222dd9d128d16db6ee9900;>dcd5cae Revert [maven-release-plugin] prepare for next development iteration https://github.com/apache/maven-shade-plugin/commit/b2d5b53f88f05616f4b92dc14f800b48bfbc9a52;>b2d5b53 [maven-release-plugin] prepare for next development iteration https://github.com/apache/maven-shade-plugin/commit/a09e6de960061ccf600ad0c979df99d748770a55;>a09e6de [maven-release-plugin] prepare release maven-shade-plugin-3.3.1 https://github.com/apache/maven-shade-plugin/commit/875114a0c8f56dcce5dcc354d095d356dee0767a;>875114a [MSHADE-416] Fix Jenkins URL https://github.com/apache/maven-shade-plugin/commit/ad2f6f8e7855860b69b950d14ca8ec627b099d6b;>ad2f6f8 [MSHADE-425] Relocate services name before add to serviceEntries https://github.com/apache/maven-shade-plugin/commit/26b587384bb664daf59c30e72693ee1ae105fd71;>26b5873 gha shared v3 https://github.com/apache/maven-shade-plugin/commit/3994b11b02182db588aa76d928b9ecc949ef15c3;>3994b11 Bump xmlunit-legacy from 2.7.0 to 2.9.0 https://github.com/apache/maven-shade-plugin/commit/89d9e791275450a0d742221d798005330ea797cc;>89d9e79 Added release drafter. Additional commits viewable in https://github.com/apache/maven-shade-plugin/compare/maven-shade-plugin-3.3.0...maven-shade-plugin-3.4.0;>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugins:maven-shade-plugin=maven=3.3.0=3.4.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] dependabot[bot] opened a new pull request, #696: Bump aws.version from 1.12.301 to 1.12.303
dependabot[bot] opened a new pull request, #696: URL: https://github.com/apache/tika/pull/696 Bumps `aws.version` from 1.12.301 to 1.12.303. Updates `aws-java-sdk-transcribe` from 1.12.301 to 1.12.303 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-transcribe's changelog. 1.12.303 2022-09-14 AWS Amplify UI Builder Features Amplify Studio UIBuilder is introducing forms functionality. Forms can be configured from Data Store models, JSON, or from scratch. These forms can then be generated in your project and used like any other React components. Amazon Elastic Compute Cloud Features This update introduces API operations to manage and create local gateway route tables, CoIP pools, and VIF group associations. 1.12.302 2022-09-13 AWS Transfer Family Features This release introduces the ability to have multiple server host keys for any of your Transfer Family servers that use the SFTP protocol. AWSKendraFrontendService Features This release enables our customer to choose the option of Sharepoint 2019 for the on-premise Sharepoint connector. Amazon CloudWatch Evidently Features This release adds support for the client-side evaluation - powered by AWS AppConfig feature. Amazon Connect Customer Profiles Features Added isUnstructured in response for Customer Profiles Integration APIs Amazon Elastic Compute Cloud Features Two new features for local gateway route tables: support for static routes targeting Elastic Network Interfaces and direct VPC routing. Elastic Disaster Recovery Service Features Fixed the data type of lagDuration that is returned in Describe Source Server API Commits https://github.com/aws/aws-sdk-java/commit/69485f42087cf9dc8cc39ec64c83c6274a40ed0c;>69485f4 AWS SDK for Java 1.12.303 https://github.com/aws/aws-sdk-java/commit/16bc711b176e85482e324a8130ec8fc2e86be87d;>16bc711 Update GitHub version number to 1.12.303-SNAPSHOT https://github.com/aws/aws-sdk-java/commit/cdb8ca809845bb5e32c00f4f27c67175cfc64809;>cdb8ca8 AWS SDK for Java 1.12.302 https://github.com/aws/aws-sdk-java/commit/6534fd1bf1131bdda73a83ca3feb11669878cde3;>6534fd1 Update GitHub version number to 1.12.302-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.301...1.12.303;>compare view Updates `aws-java-sdk-s3` from 1.12.301 to 1.12.303 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-s3's changelog. 1.12.303 2022-09-14 AWS Amplify UI Builder Features Amplify Studio UIBuilder is introducing forms functionality. Forms can be configured from Data Store models, JSON, or from scratch. These forms can then be generated in your project and used like any other React components. Amazon Elastic Compute Cloud Features This update introduces API operations to manage and create local gateway route tables, CoIP pools, and VIF group associations. 1.12.302 2022-09-13 AWS Transfer Family Features This release introduces the ability to have multiple server host keys for any of your Transfer Family servers that use the SFTP protocol. AWSKendraFrontendService Features This release enables our customer to choose the option of Sharepoint 2019 for the on-premise Sharepoint connector. Amazon CloudWatch Evidently Features This release adds support for the client-side evaluation - powered by AWS AppConfig feature. Amazon Connect Customer Profiles Features Added isUnstructured in response for Customer Profiles Integration APIs Amazon Elastic Compute Cloud Features Two new features for local gateway route tables: support for static routes targeting Elastic Network Interfaces and direct VPC routing. Elastic Disaster Recovery Service Features Fixed the data type of lagDuration that is returned in Describe Source Server API Commits https://github.com/aws/aws-sdk-java/commit/69485f42087cf9dc8cc39ec64c83c6274a40ed0c;>69485f4 AWS SDK for Java 1.12.303 https://github.com/aws/aws-sdk-java/commit/16bc711b176e85482e324a8130ec8fc2e86be87d;>16bc711 Update GitHub version number to 1.12.303-SNAPSHOT https://github.com/aws/aws-sdk-java/commit/cdb8ca809845bb5e32c00f4f27c67175cfc64809;>cdb8ca8 AWS SDK for Java 1.12.302 https://github.com/aws/aws-sdk-java/commit/6534fd1bf1131bdda73a83ca3feb11669878cde3;>6534fd1 Update GitHub version number to 1.12.302-SNAPSHOT See full diff in
[GitHub] [tika] dependabot[bot] opened a new pull request, #695: Bump protobuf-java from 3.21.5 to 3.21.6
dependabot[bot] opened a new pull request, #695: URL: https://github.com/apache/tika/pull/695 Bumps [protobuf-java](https://github.com/protocolbuffers/protobuf) from 3.21.5 to 3.21.6. Commits https://github.com/protocolbuffers/protobuf/commit/24487dd1045c7f3d64a21f38a3f0c06cc4cf2edb;>24487dd Updating version.json and repo version numbers to: 21.6 https://github.com/protocolbuffers/protobuf/commit/d88266c319e42650344f3c5df3a0feecc7865fb5;>d88266c Merge pull request https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10545;>#10545 from deannagarcia/21.x https://github.com/protocolbuffers/protobuf/commit/cd0ee8f45d0d749a1e4deb9847e53efb62c04d7b;>cd0ee8f Apply patch https://github.com/protocolbuffers/protobuf/commit/ea2f20498e2853a58875f247b06edcb567ccd86b;>ea2f204 Uninstall system protobuf to prevent version conflicts (https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10522;>#10522) https://github.com/protocolbuffers/protobuf/commit/aafacb09c75d521b11500970827214f2247dd4aa;>aafacb0 Remove broken use_bazel.sh (https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10511;>#10511) https://github.com/protocolbuffers/protobuf/commit/40847c7ee5848f41c505a1ece1f27ec4a687837b;>40847c7 Fix Kokoro tests to work on Monterey machines (https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10473;>#10473) https://github.com/protocolbuffers/protobuf/commit/2fb33f46a6cf6dc20fb76edef7e00162b5eedb44;>2fb33f4 Merge pull request https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10382;>#10382 from protocolbuffers/21.x-202208092202 https://github.com/protocolbuffers/protobuf/commit/29f03e04d3f72b1749b1bf720183b0fb9b6b7d69;>29f03e0 Update version.json to: 21.6-dev https://github.com/protocolbuffers/protobuf/commit/638779f353731a0a04496bde20d14164684c3d93;>638779f Merge pull request https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10380;>#10380 from protocolbuffers/21.x-202208091710 See full diff in https://github.com/protocolbuffers/protobuf/compare/v3.21.5...v3.21.6;>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.google.protobuf:protobuf-java=maven=3.21.5=3.21.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org