[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833385#comment-17833385 ] Tilman Hausherr commented on TIKA-4231: --- No this is not being worked on. You'll have to use OCR. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833344#comment-17833344 ] Tim Allison commented on TIKA-4231: --- If you run Poppler's pdftotext against the file or copy and paste out of Adobe Reader into a text file, do you get higher quality text? > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833329#comment-17833329 ] Aamir commented on TIKA-4231: - Is this issue being worked on? Any updates please? > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [ANNOUNCE] Apache Tika 2.9.2 released
All good. I’m looking into a way to just automate the Helm Chart release based on a Webhook payload every time a new Docker container image is pushed to DockerHub. That would simplify things some… On Tue, Apr 2, 2024 at 12:24 Tim Allison wrote: > Oops: > https://cwiki.apache.org/confluence/display/TIKA/Release+Process+for+tika-helm > > Help... > > On Tue, Apr 2, 2024 at 3:22 PM Tim Allison wrote: > > > > I did a global and thoughtless find/replace. Please review and merge > > if this makes sense: https://github.com/apache/tika-helm/pull/19 > > > > cc @lewis john mcgibbney > > > > On Tue, Apr 2, 2024 at 3:09 PM Tim Allison wrote: > > > > > > I also released our docker images for 2.9.2.0. > > > > > > How do we update helm? > > > > > > On Tue, Apr 2, 2024 at 2:31 PM Tim Allison > wrote: > > > > > > > > The Apache Tika project is pleased to announce the release of Apache > > > > Tika 2.9.2. The release contents have been pushed out to the main > > > > Apache release site and to the Maven Central sync. > > > > > > > > Apache Tika is a toolkit for detecting and extracting metadata and > > > > structured text content from various documents using existing parser > > > > libraries. > > > > > > > > Apache Tika 2.9.2 includes numerous bug fixes and dependency > upgrades. > > > > Details can be found in the changes file: > > > > https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt > > > > > > > > Apache Tika is available on the download page: > > > > https://tika.apache.org/download.html > > > > > > > > Apache Tika is also available in binary form or for use using Maven 2 > > > > from the Central Repository: > > > > https://repo1.maven.org/maven2/org/apache/tika/ > > > > > > > > When downloading, please remember to verify the downloads using > > > > signatures found: https://www.apache.org/dist/tika/KEYS > > > > > > > > For more information on Apache Tika, visit the project home page: > > > > https://tika.apache.org/ > > > > > > > > -- Tim Allison, on behalf of the Apache Tika community >
Re: [ANNOUNCE] Apache Tika 2.9.2 released
Oops: https://cwiki.apache.org/confluence/display/TIKA/Release+Process+for+tika-helm Help... On Tue, Apr 2, 2024 at 3:22 PM Tim Allison wrote: > > I did a global and thoughtless find/replace. Please review and merge > if this makes sense: https://github.com/apache/tika-helm/pull/19 > > cc @lewis john mcgibbney > > On Tue, Apr 2, 2024 at 3:09 PM Tim Allison wrote: > > > > I also released our docker images for 2.9.2.0. > > > > How do we update helm? > > > > On Tue, Apr 2, 2024 at 2:31 PM Tim Allison wrote: > > > > > > The Apache Tika project is pleased to announce the release of Apache > > > Tika 2.9.2. The release contents have been pushed out to the main > > > Apache release site and to the Maven Central sync. > > > > > > Apache Tika is a toolkit for detecting and extracting metadata and > > > structured text content from various documents using existing parser > > > libraries. > > > > > > Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades. > > > Details can be found in the changes file: > > > https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt > > > > > > Apache Tika is available on the download page: > > > https://tika.apache.org/download.html > > > > > > Apache Tika is also available in binary form or for use using Maven 2 > > > from the Central Repository: > > > https://repo1.maven.org/maven2/org/apache/tika/ > > > > > > When downloading, please remember to verify the downloads using > > > signatures found: https://www.apache.org/dist/tika/KEYS > > > > > > For more information on Apache Tika, visit the project home page: > > > https://tika.apache.org/ > > > > > > -- Tim Allison, on behalf of the Apache Tika community
Re: [ANNOUNCE] Apache Tika 2.9.2 released
I did a global and thoughtless find/replace. Please review and merge if this makes sense: https://github.com/apache/tika-helm/pull/19 cc @lewis john mcgibbney On Tue, Apr 2, 2024 at 3:09 PM Tim Allison wrote: > > I also released our docker images for 2.9.2.0. > > How do we update helm? > > On Tue, Apr 2, 2024 at 2:31 PM Tim Allison wrote: > > > > The Apache Tika project is pleased to announce the release of Apache > > Tika 2.9.2. The release contents have been pushed out to the main > > Apache release site and to the Maven Central sync. > > > > Apache Tika is a toolkit for detecting and extracting metadata and > > structured text content from various documents using existing parser > > libraries. > > > > Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades. > > Details can be found in the changes file: > > https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt > > > > Apache Tika is available on the download page: > > https://tika.apache.org/download.html > > > > Apache Tika is also available in binary form or for use using Maven 2 > > from the Central Repository: > > https://repo1.maven.org/maven2/org/apache/tika/ > > > > When downloading, please remember to verify the downloads using > > signatures found: https://www.apache.org/dist/tika/KEYS > > > > For more information on Apache Tika, visit the project home page: > > https://tika.apache.org/ > > > > -- Tim Allison, on behalf of the Apache Tika community
[PR] 2.9.2.0 release [tika-helm]
tballison opened a new pull request, #19: URL: https://github.com/apache/tika-helm/pull/19 This is a draft to update the Tika version to 2.9.2. I don't know what I'm doing with helm. Please review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [ANNOUNCE] Apache Tika 2.9.2 released
I also released our docker images for 2.9.2.0. How do we update helm? On Tue, Apr 2, 2024 at 2:31 PM Tim Allison wrote: > > The Apache Tika project is pleased to announce the release of Apache > Tika 2.9.2. The release contents have been pushed out to the main > Apache release site and to the Maven Central sync. > > Apache Tika is a toolkit for detecting and extracting metadata and > structured text content from various documents using existing parser > libraries. > > Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades. > Details can be found in the changes file: > https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt > > Apache Tika is available on the download page: > https://tika.apache.org/download.html > > Apache Tika is also available in binary form or for use using Maven 2 > from the Central Repository: > https://repo1.maven.org/maven2/org/apache/tika/ > > When downloading, please remember to verify the downloads using > signatures found: https://www.apache.org/dist/tika/KEYS > > For more information on Apache Tika, visit the project home page: > https://tika.apache.org/ > > -- Tim Allison, on behalf of the Apache Tika community
[ANNOUNCE] Apache Tika 2.9.2 released
The Apache Tika project is pleased to announce the release of Apache Tika 2.9.2. The release contents have been pushed out to the main Apache release site and to the Maven Central sync. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades. Details can be found in the changes file: https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt Apache Tika is available on the download page: https://tika.apache.org/download.html Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: https://repo1.maven.org/maven2/org/apache/tika/ When downloading, please remember to verify the downloads using signatures found: https://www.apache.org/dist/tika/KEYS For more information on Apache Tika, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community
Re: [PR] Support for adding custom tika configuration [tika-helm]
t-l-k commented on PR #15: URL: https://github.com/apache/tika-helm/pull/15#issuecomment-2032395440 @lewismc @nddipiazza I'm champing at the bit to to see this merged, xml configuration essential in Tika v2+ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [RESULT][VOTE] Release Apache Tika 2.9.2 Candidate #2
+1 , Thank You! Tomas On Tue, Apr 2, 2024, 13:10 Tim Allison wrote: > The vote has passed with 3 PMC +1s and no -1s. > > +1s > Oleg Tikhonov > Tilman Hausherr > Tim Allison > > I'll release the artifacts shortly and update the website. > > Thank you, all! > > Best, > > Tim > > On Tue, Apr 2, 2024 at 12:08 AM Oleg Tikhonov > wrote: > > > +1, > > Thanks. > > > > On Mon, 1 Apr 2024 at 23:36 Tim Allison wrote: > > > > > Any fellow devs able to vote? We need one more vote. Thank you! > > > > > > On Tue, Mar 26, 2024 at 12:22 PM Tilman Hausherr < > thaush...@t-online.de> > > > wrote: > > > > > > > +1 > > > > > > > > successful build on Windows 10, oracle jdk 1.8.0_391 > > > > > > > > Tilman > > > > > > > > On 26.03.2024 16:52, Tim Allison wrote: > > > > > A candidate for the Tika 2.9.2 release is available at: > > > > > https://dist.apache.org/repos/dist/dev/tika/2.9.2 > > > > > > > > > > The release candidate is a zip archive of the sources in: > > > > > https://github.com/apache/tika/tree/2.9.2-rc2/ > > > > > > > > > > The SHA-512 checksum of the archive is > > > > > > > > > > > > > > > 5ac7b981aa89d44e177dfb457d6f6b73dd54d43641da31e76b3e8bd9dbc236b9d2e6f6958d9182f36cbee6409293f3f21421f9c89837f693f5e10f997e9b063c. > > > > > > > > > > In addition, a staged maven repository is available here: > > > > > > > > > > > > > > > https://repository.apache.org/content/repositories/orgapachetika-1099/org/apache/tika > > > > > > > > > > Please vote on releasing this package as Apache Tika 2.9.2. > > > > > The vote is open for the next 72 hours and passes if a majority of > at > > > > > least three +1 Tika PMC votes are cast. > > > > > > > > > > [ ] +1 Release this package as Apache Tika 2.9.2 > > > > > [ ] -1 Do not release this package because... > > > > > > > > > > Here's my +1 > > > > > > > > > > Best, > > > > > > > > > >Tim > > > > > > > > > > > > > > > > > >
[RESULT][VOTE] Release Apache Tika 2.9.2 Candidate #2
The vote has passed with 3 PMC +1s and no -1s. +1s Oleg Tikhonov Tilman Hausherr Tim Allison I'll release the artifacts shortly and update the website. Thank you, all! Best, Tim On Tue, Apr 2, 2024 at 12:08 AM Oleg Tikhonov wrote: > +1, > Thanks. > > On Mon, 1 Apr 2024 at 23:36 Tim Allison wrote: > > > Any fellow devs able to vote? We need one more vote. Thank you! > > > > On Tue, Mar 26, 2024 at 12:22 PM Tilman Hausherr > > wrote: > > > > > +1 > > > > > > successful build on Windows 10, oracle jdk 1.8.0_391 > > > > > > Tilman > > > > > > On 26.03.2024 16:52, Tim Allison wrote: > > > > A candidate for the Tika 2.9.2 release is available at: > > > > https://dist.apache.org/repos/dist/dev/tika/2.9.2 > > > > > > > > The release candidate is a zip archive of the sources in: > > > > https://github.com/apache/tika/tree/2.9.2-rc2/ > > > > > > > > The SHA-512 checksum of the archive is > > > > > > > > > > 5ac7b981aa89d44e177dfb457d6f6b73dd54d43641da31e76b3e8bd9dbc236b9d2e6f6958d9182f36cbee6409293f3f21421f9c89837f693f5e10f997e9b063c. > > > > > > > > In addition, a staged maven repository is available here: > > > > > > > > > > https://repository.apache.org/content/repositories/orgapachetika-1099/org/apache/tika > > > > > > > > Please vote on releasing this package as Apache Tika 2.9.2. > > > > The vote is open for the next 72 hours and passes if a majority of at > > > > least three +1 Tika PMC votes are cast. > > > > > > > > [ ] +1 Release this package as Apache Tika 2.9.2 > > > > [ ] -1 Do not release this package because... > > > > > > > > Here's my +1 > > > > > > > > Best, > > > > > > > >Tim > > > > > > > > > > > >
Java 22 is GA + Heads-up!
Welcome to the latest OpenJDK Quality Outreach update! Java 22 was just released along with JavaFX 22 [1][2]. Thank you to all the projects who contributed to those releases by testing and providing feedback using their respective early-access builds. And to celebrate that, the Java DevRel Team hosted a +4h live-stream with guests such as Brian Goetz, Viktor Klang, Alan Bateman, etc. You can watch the launch stream replay here [3]. The JDK 23 schedule is now known [4] with rampdown starting early June and general availability sets for mid-September. So far, 2 JEPs have been targeted to JDK 23: - JEP 455: Primitive Types in Patterns, instanceof, and switch (Preview) [5] - JEP 466: Class-File API (2nd Preview) [6] The focus should now be shifted to testing your project(s) on JDK 23. And don't forget that the Oracle setup-java github action [7] supports, amongst others, the latest OpenJDK 23 Early-Access builds. So, JDK 23 EA testing is literally one pipeline away. [1] https://mail.openjdk.org/pipermail/jdk-dev/2024-March/008827.html [3] https://jdk.java.net/javafx22/ [3] https://www.youtube.com/live/AjjAZsnRXtE?feature=shared=278 [4] https://openjdk.org/projects/jdk/23/ [5] https://openjdk.org/jeps/455 [6] https://openjdk.org/jeps/466 [7] https://github.com/oracle-actions/setup-java ## Heads-up: JDK 20-23: Support for Unicode CLDR Version 42 The JDK update to CLDR version 42 included a change where regular spaces in date/time formats (and some other formatted values) were replaced with (narrow) non-breaking spaces. This lead to issues for existing code that relied on parsing such strings. To address that, JDK 23 allows loose matching of spaces when parsing date/time strings. Loose matching is performed in the lenient parsing style for both date/time parsers in `java.time.format` and `java.text` packages. In the default strict parsing style, those spaces are considered distinct as before. Please read this updated heads-up [9] for details on how to configure strict/lenient parsing in the `java.time.format` (strict by default) and `java.text` (lenient by default) packages. [9] https://inside.java/2024/03/29/quality-heads-up/ ## Heads-up: macOS 14 users running on Apple silicon systems should update directly to macOS 14.4.1 An issue introduced by macOS 14.4 caused some Java processes, regardless of the Java version, to terminate unexpectedly on Apple silicon (AArch64). On March 25 Apple released macOS 14.4.1 and indicated on their support site that it addresses this issue. Oracle can confirm that after applying macOS 14.4.1 we are unable to reproduce the problem. So, Java users on macOS 14 running on Apple silicon systems should skip macOS 14.4 and update directly to macOS 14.4.1. More details can be found on https://blogs.oracle.com/java/post/java-on-macos-14-4 ## JDK 23 Early-Access Builds The JDK 23 EA builds 16 are available [10], and are provided under the GNU General Public License v2, with the Classpath Exception. The Release Notes [11] are also available. ### Changes in recent JDK 23 builds that may be of interest: - JDK-8324774: Add DejaVu web fonts (reported by AssertJ) - JDK-8327385: Add JavaDoc option to exclude web fonts from generated documentation (reported by AssertJ) - JDK-8328638: Fallback option for POST-only OCSP requests - JDK-8320362: Load anchor certificates from Keychain keystore - JDK-8327875: ChoiceFormat should advise throwing UnsupportedOperationException for unused methods - JDK-8296244: Alternate implementation of user-based authorization Subject APIs that doesn’t depend on Security Manager APIs - JDK-8327818: Implement Kerberos debug with sun.security.util.Debug - JDK-7036144: GZIPInputStream readTrailer uses faulty available() test for end-of-stream - JDK-8319251: Change LockingMode default from LM_LEGACY to LM_LIGHTWEIGHT - JDK-8327651: Rename DictionaryEntry members related to protection domain - JDK-8321408: Add Certainly roots R1 and E1 - JDK-8164094: javadoc allows to create a @link to a non-existent method - JDK-8325496: Make TrimNativeHeapInterval a product switch - JDK-8174269: Remove COMPAT locale data provider from JDK - JDK-8322750: Test "api/java_awt/interactive/SystemTrayTests.html" failed because … - JDK-8139457: Relax alignment of array elements - JDK-8256314: JVM TI GetCurrentContendedMonitor is implemented incorrectly - JDK-8326908: DecimalFormat::toPattern throws OutOfMemoryError when pattern is empty string - JDK-8247972: incorrect implementation of JVM TI GetObjectMonitorUsage - JDK-8325580: Remove "alternatives --remove" call from Java rpm installer - JDK-8326838: JFR: Native mirror events - JDK-8326106: Write and clear stack trace table outside of safepoint - JDK-8323183: ClassFile API performance improvements - JDK-8324829: Uniform use of synchronizations in NMT - JDK-8326586: Improve Speed of System.map - JDK-8318761: MessageFormat pattern support for CompactNumberFormat, ListFormat, and DateTimeFormatter