[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833385#comment-17833385
 ] 

Tilman Hausherr commented on TIKA-4231:
---

No this is not being worked on. You'll have to use OCR.

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833344#comment-17833344
 ] 

Tim Allison commented on TIKA-4231:
---

If you run Poppler's pdftotext against the file or copy and paste out of Adobe 
Reader into a text file, do you get higher quality text?

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-04-02 Thread Aamir (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833329#comment-17833329
 ] 

Aamir commented on TIKA-4231:
-

Is this issue being worked on? Any updates please?

> Parsing Arabic PDF is returning bad data
> 
>
> Key: TIKA-4231
> URL: https://issues.apache.org/jira/browse/TIKA-4231
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.9.1
> Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>Reporter: Aamir
>Priority: Major
> Attachments: arabic-pdfbox.txt, arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread lewis john mcgibbney
All good.
I’m looking into a way to just automate the Helm Chart release based on a
Webhook payload every time a new Docker container image is pushed to
DockerHub.
That would simplify things some…


On Tue, Apr 2, 2024 at 12:24 Tim Allison  wrote:

> Oops:
> https://cwiki.apache.org/confluence/display/TIKA/Release+Process+for+tika-helm
>
> Help...
>
> On Tue, Apr 2, 2024 at 3:22 PM Tim Allison  wrote:
> >
> > I did a global and thoughtless find/replace. Please review and merge
> > if this makes sense: https://github.com/apache/tika-helm/pull/19
> >
> > cc @lewis john mcgibbney
> >
> > On Tue, Apr 2, 2024 at 3:09 PM Tim Allison  wrote:
> > >
> > > I also released our docker images for 2.9.2.0.
> > >
> > > How do we update helm?
> > >
> > > On Tue, Apr 2, 2024 at 2:31 PM Tim Allison 
> wrote:
> > > >
> > > > The Apache Tika project is pleased to announce the release of Apache
> > > > Tika 2.9.2. The release contents have been pushed out to the main
> > > > Apache release site and to the Maven Central sync.
> > > >
> > > > Apache Tika is a toolkit for detecting and extracting metadata and
> > > > structured text content from various documents using existing parser
> > > > libraries.
> > > >
> > > > Apache Tika 2.9.2 includes numerous bug fixes and dependency
> upgrades.
> > > > Details can be found in the changes file:
> > > > https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt
> > > >
> > > > Apache Tika is available on the download page:
> > > > https://tika.apache.org/download.html
> > > >
> > > > Apache Tika is also available in binary form or for use using Maven 2
> > > > from the Central Repository:
> > > > https://repo1.maven.org/maven2/org/apache/tika/
> > > >
> > > > When downloading, please remember to verify the downloads using
> > > > signatures found: https://www.apache.org/dist/tika/KEYS
> > > >
> > > > For more information on Apache Tika, visit the project home page:
> > > > https://tika.apache.org/
> > > >
> > > > -- Tim Allison, on behalf of the Apache Tika community
>


Re: [ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
Oops: 
https://cwiki.apache.org/confluence/display/TIKA/Release+Process+for+tika-helm

Help...

On Tue, Apr 2, 2024 at 3:22 PM Tim Allison  wrote:
>
> I did a global and thoughtless find/replace. Please review and merge
> if this makes sense: https://github.com/apache/tika-helm/pull/19
>
> cc @lewis john mcgibbney
>
> On Tue, Apr 2, 2024 at 3:09 PM Tim Allison  wrote:
> >
> > I also released our docker images for 2.9.2.0.
> >
> > How do we update helm?
> >
> > On Tue, Apr 2, 2024 at 2:31 PM Tim Allison  wrote:
> > >
> > > The Apache Tika project is pleased to announce the release of Apache
> > > Tika 2.9.2. The release contents have been pushed out to the main
> > > Apache release site and to the Maven Central sync.
> > >
> > > Apache Tika is a toolkit for detecting and extracting metadata and
> > > structured text content from various documents using existing parser
> > > libraries.
> > >
> > > Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades.
> > > Details can be found in the changes file:
> > > https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt
> > >
> > > Apache Tika is available on the download page:
> > > https://tika.apache.org/download.html
> > >
> > > Apache Tika is also available in binary form or for use using Maven 2
> > > from the Central Repository:
> > > https://repo1.maven.org/maven2/org/apache/tika/
> > >
> > > When downloading, please remember to verify the downloads using
> > > signatures found: https://www.apache.org/dist/tika/KEYS
> > >
> > > For more information on Apache Tika, visit the project home page:
> > > https://tika.apache.org/
> > >
> > > -- Tim Allison, on behalf of the Apache Tika community


Re: [ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
I did a global and thoughtless find/replace. Please review and merge
if this makes sense: https://github.com/apache/tika-helm/pull/19

cc @lewis john mcgibbney

On Tue, Apr 2, 2024 at 3:09 PM Tim Allison  wrote:
>
> I also released our docker images for 2.9.2.0.
>
> How do we update helm?
>
> On Tue, Apr 2, 2024 at 2:31 PM Tim Allison  wrote:
> >
> > The Apache Tika project is pleased to announce the release of Apache
> > Tika 2.9.2. The release contents have been pushed out to the main
> > Apache release site and to the Maven Central sync.
> >
> > Apache Tika is a toolkit for detecting and extracting metadata and
> > structured text content from various documents using existing parser
> > libraries.
> >
> > Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades.
> > Details can be found in the changes file:
> > https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt
> >
> > Apache Tika is available on the download page:
> > https://tika.apache.org/download.html
> >
> > Apache Tika is also available in binary form or for use using Maven 2
> > from the Central Repository:
> > https://repo1.maven.org/maven2/org/apache/tika/
> >
> > When downloading, please remember to verify the downloads using
> > signatures found: https://www.apache.org/dist/tika/KEYS
> >
> > For more information on Apache Tika, visit the project home page:
> > https://tika.apache.org/
> >
> > -- Tim Allison, on behalf of the Apache Tika community


[PR] 2.9.2.0 release [tika-helm]

2024-04-02 Thread via GitHub


tballison opened a new pull request, #19:
URL: https://github.com/apache/tika-helm/pull/19

   This is a draft to update the Tika version to 2.9.2. I don't know what I'm 
doing with helm. Please review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
I also released our docker images for 2.9.2.0.

How do we update helm?

On Tue, Apr 2, 2024 at 2:31 PM Tim Allison  wrote:
>
> The Apache Tika project is pleased to announce the release of Apache
> Tika 2.9.2. The release contents have been pushed out to the main
> Apache release site and to the Maven Central sync.
>
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries.
>
> Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades.
> Details can be found in the changes file:
> https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt
>
> Apache Tika is available on the download page:
> https://tika.apache.org/download.html
>
> Apache Tika is also available in binary form or for use using Maven 2
> from the Central Repository:
> https://repo1.maven.org/maven2/org/apache/tika/
>
> When downloading, please remember to verify the downloads using
> signatures found: https://www.apache.org/dist/tika/KEYS
>
> For more information on Apache Tika, visit the project home page:
> https://tika.apache.org/
>
> -- Tim Allison, on behalf of the Apache Tika community


[ANNOUNCE] Apache Tika 2.9.2 released

2024-04-02 Thread Tim Allison
The Apache Tika project is pleased to announce the release of Apache
Tika 2.9.2. The release contents have been pushed out to the main
Apache release site and to the Maven Central sync.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries.

Apache Tika 2.9.2 includes numerous bug fixes and dependency upgrades.
Details can be found in the changes file:
https://www.apache.org/dist/tika/2.9.2/CHANGES-2.9.2.txt

Apache Tika is available on the download page:
https://tika.apache.org/download.html

Apache Tika is also available in binary form or for use using Maven 2
from the Central Repository:
https://repo1.maven.org/maven2/org/apache/tika/

When downloading, please remember to verify the downloads using
signatures found: https://www.apache.org/dist/tika/KEYS

For more information on Apache Tika, visit the project home page:
https://tika.apache.org/

-- Tim Allison, on behalf of the Apache Tika community


Re: [PR] Support for adding custom tika configuration [tika-helm]

2024-04-02 Thread via GitHub


t-l-k commented on PR #15:
URL: https://github.com/apache/tika-helm/pull/15#issuecomment-2032395440

   @lewismc @nddipiazza I'm champing at the bit to to see this merged, xml 
configuration essential in Tika v2+


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [RESULT][VOTE] Release Apache Tika 2.9.2 Candidate #2

2024-04-02 Thread Tomas Lanik
+1 , Thank You!
Tomas

On Tue, Apr 2, 2024, 13:10 Tim Allison  wrote:

> The vote has passed with 3 PMC +1s and no -1s.
>
> +1s
> Oleg Tikhonov
> Tilman Hausherr
> Tim Allison
>
> I'll release the artifacts shortly and update the website.
>
> Thank you, all!
>
> Best,
>
>  Tim
>
> On Tue, Apr 2, 2024 at 12:08 AM Oleg Tikhonov 
> wrote:
>
> > +1,
> > Thanks.
> >
> > On Mon, 1 Apr 2024 at 23:36 Tim Allison  wrote:
> >
> > > Any fellow devs able to vote? We need one more vote. Thank you!
> > >
> > > On Tue, Mar 26, 2024 at 12:22 PM Tilman Hausherr <
> thaush...@t-online.de>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > successful build on Windows 10, oracle jdk 1.8.0_391
> > > >
> > > > Tilman
> > > >
> > > > On 26.03.2024 16:52, Tim Allison wrote:
> > > > > A candidate for the Tika 2.9.2 release is available at:
> > > > > https://dist.apache.org/repos/dist/dev/tika/2.9.2
> > > > >
> > > > > The release candidate is a zip archive of the sources in:
> > > > > https://github.com/apache/tika/tree/2.9.2-rc2/
> > > > >
> > > > > The SHA-512 checksum of the archive is
> > > > >
> > > >
> > >
> >
> 5ac7b981aa89d44e177dfb457d6f6b73dd54d43641da31e76b3e8bd9dbc236b9d2e6f6958d9182f36cbee6409293f3f21421f9c89837f693f5e10f997e9b063c.
> > > > >
> > > > > In addition, a staged maven repository is available here:
> > > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1099/org/apache/tika
> > > > >
> > > > > Please vote on releasing this package as Apache Tika 2.9.2.
> > > > > The vote is open for the next 72 hours and passes if a majority of
> at
> > > > > least three +1 Tika PMC votes are cast.
> > > > >
> > > > > [ ] +1 Release this package as Apache Tika 2.9.2
> > > > > [ ] -1 Do not release this package because...
> > > > >
> > > > > Here's my +1
> > > > >
> > > > > Best,
> > > > >
> > > > >Tim
> > > >
> > > >
> > > >
> > >
> >
>


[RESULT][VOTE] Release Apache Tika 2.9.2 Candidate #2

2024-04-02 Thread Tim Allison
The vote has passed with 3 PMC +1s and no -1s.

+1s
Oleg Tikhonov
Tilman Hausherr
Tim Allison

I'll release the artifacts shortly and update the website.

Thank you, all!

Best,

 Tim

On Tue, Apr 2, 2024 at 12:08 AM Oleg Tikhonov 
wrote:

> +1,
> Thanks.
>
> On Mon, 1 Apr 2024 at 23:36 Tim Allison  wrote:
>
> > Any fellow devs able to vote? We need one more vote. Thank you!
> >
> > On Tue, Mar 26, 2024 at 12:22 PM Tilman Hausherr 
> > wrote:
> >
> > > +1
> > >
> > > successful build on Windows 10, oracle jdk 1.8.0_391
> > >
> > > Tilman
> > >
> > > On 26.03.2024 16:52, Tim Allison wrote:
> > > > A candidate for the Tika 2.9.2 release is available at:
> > > > https://dist.apache.org/repos/dist/dev/tika/2.9.2
> > > >
> > > > The release candidate is a zip archive of the sources in:
> > > > https://github.com/apache/tika/tree/2.9.2-rc2/
> > > >
> > > > The SHA-512 checksum of the archive is
> > > >
> > >
> >
> 5ac7b981aa89d44e177dfb457d6f6b73dd54d43641da31e76b3e8bd9dbc236b9d2e6f6958d9182f36cbee6409293f3f21421f9c89837f693f5e10f997e9b063c.
> > > >
> > > > In addition, a staged maven repository is available here:
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1099/org/apache/tika
> > > >
> > > > Please vote on releasing this package as Apache Tika 2.9.2.
> > > > The vote is open for the next 72 hours and passes if a majority of at
> > > > least three +1 Tika PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Tika 2.9.2
> > > > [ ] -1 Do not release this package because...
> > > >
> > > > Here's my +1
> > > >
> > > > Best,
> > > >
> > > >Tim
> > >
> > >
> > >
> >
>


Java 22 is GA + Heads-up!

2024-04-02 Thread David Delabassee
Welcome to the latest OpenJDK Quality Outreach update!

Java 22 was just released along with JavaFX 22 [1][2]. Thank you to all the 
projects who contributed to those releases by testing and providing feedback 
using their respective early-access builds. And to celebrate that, the Java 
DevRel Team hosted a +4h live-stream with guests such as Brian Goetz, Viktor 
Klang, Alan Bateman, etc. You can watch the launch stream replay here [3].

The JDK 23 schedule is now known [4] with rampdown starting early June and 
general availability sets for mid-September. So far, 2 JEPs have been targeted 
to JDK 23:
- JEP 455: Primitive Types in Patterns, instanceof, and switch (Preview) [5]
- JEP 466: Class-File API (2nd Preview) [6]

The focus should now be shifted to testing your project(s) on JDK 23. And don't 
forget that the Oracle setup-java github action [7] supports, amongst others, 
the latest OpenJDK 23 Early-Access builds. So, JDK 23 EA testing is literally 
one pipeline away.

[1] https://mail.openjdk.org/pipermail/jdk-dev/2024-March/008827.html
[3] https://jdk.java.net/javafx22/
[3] https://www.youtube.com/live/AjjAZsnRXtE?feature=shared=278
[4] https://openjdk.org/projects/jdk/23/
[5] https://openjdk.org/jeps/455
[6] https://openjdk.org/jeps/466
[7] https://github.com/oracle-actions/setup-java


## Heads-up: JDK 20-23: Support for Unicode CLDR Version 42

The JDK update to CLDR version 42 included a change where regular spaces in 
date/time formats (and some other formatted values) were replaced with (narrow) 
non-breaking spaces. This lead to issues for existing code that relied on 
parsing such strings. To address that, JDK 23 allows loose matching of spaces 
when parsing date/time strings. Loose matching is performed in the lenient 
parsing style for both date/time parsers in `java.time.format` and `java.text` 
packages. In the default strict parsing style, those spaces are considered 
distinct as before.

Please read this updated heads-up [9] for details on how to configure 
strict/lenient parsing in the `java.time.format` (strict by default) and 
`java.text` (lenient by default) packages.

[9] https://inside.java/2024/03/29/quality-heads-up/


## Heads-up: macOS 14 users running on Apple silicon systems should update 
directly to macOS 14.4.1

An issue introduced by macOS 14.4 caused some Java processes, regardless of the 
Java version, to terminate unexpectedly on Apple silicon (AArch64). On March 25 
Apple released macOS 14.4.1 and indicated on their support site that it 
addresses this issue. Oracle can confirm that after applying macOS 14.4.1 we 
are unable to reproduce the problem. So, Java users on macOS 14 running on 
Apple silicon systems should skip macOS 14.4 and update directly to macOS 
14.4.1.

More details can be found on 
https://blogs.oracle.com/java/post/java-on-macos-14-4


## JDK 23 Early-Access Builds

The JDK 23 EA builds 16 are available [10], and are provided under the GNU 
General Public License v2, with the Classpath Exception. The Release Notes [11] 
are also available.

### Changes in recent JDK 23 builds that may be of interest:
- JDK-8324774: Add DejaVu web fonts (reported by AssertJ)
- JDK-8327385: Add JavaDoc option to exclude web fonts from generated 
documentation (reported by AssertJ)
- JDK-8328638: Fallback option for POST-only OCSP requests
- JDK-8320362: Load anchor certificates from Keychain keystore
- JDK-8327875: ChoiceFormat should advise throwing 
UnsupportedOperationException for unused methods
- JDK-8296244: Alternate implementation of user-based authorization Subject 
APIs that doesn’t depend on Security Manager APIs
- JDK-8327818: Implement Kerberos debug with sun.security.util.Debug
- JDK-7036144: GZIPInputStream readTrailer uses faulty available() test for 
end-of-stream
- JDK-8319251: Change LockingMode default from LM_LEGACY to LM_LIGHTWEIGHT
- JDK-8327651: Rename DictionaryEntry members related to protection domain
- JDK-8321408: Add Certainly roots R1 and E1
- JDK-8164094: javadoc allows to create a @link to a non-existent method
- JDK-8325496: Make TrimNativeHeapInterval a product switch
- JDK-8174269: Remove COMPAT locale data provider from JDK
- JDK-8322750: Test "api/java_awt/interactive/SystemTrayTests.html" failed 
because …
- JDK-8139457: Relax alignment of array elements
- JDK-8256314: JVM TI GetCurrentContendedMonitor is implemented incorrectly
- JDK-8326908: DecimalFormat::toPattern throws OutOfMemoryError when pattern is 
empty string
- JDK-8247972: incorrect implementation of JVM TI GetObjectMonitorUsage
- JDK-8325580: Remove "alternatives --remove" call from Java rpm installer
- JDK-8326838: JFR: Native mirror events
- JDK-8326106: Write and clear stack trace table outside of safepoint
- JDK-8323183: ClassFile API performance improvements
- JDK-8324829: Uniform use of synchronizations in NMT
- JDK-8326586: Improve Speed of System.map
- JDK-8318761: MessageFormat pattern support for CompactNumberFormat, 
ListFormat, and DateTimeFormatter