Am 03.08.24 um 05:48 schrieb Tilman Hausherr:
I thought I had posted the link, but it seems I didn't?! Here it is
https://home.snafu.de/tilman/tmp/reports_pdfbox_3.0.2_vs_3.0.3_2.tar.xz
I had a look at the new exception. The changes from [1] are responsible
for the different behavior. IMHO that issue can be ignored. The file is
a mess and 3.0.2 isn't able to extract anything. 3.0.3 simply produces
another error message.
@Tilman
Are there any issues with the text extraction, otherwise I'm going to
cut the release this evening.
Andreas
[1] https://issues.apache.org/jira/browse/PDFBOX-5786
Tilman
On 01.08.2024 10:47, Tilman Hausherr wrote:
Thanks, I'll run another "B" and eval job but with the change from
PDFBOX-5790 reverted, like I did for 2.0.32, so we get less noise in
the content results.
Tilman
On 01.08.2024 07:56, Andreas Lehmkühler wrote:
Am 31.07.24 um 11:45 schrieb Tilman Hausherr:
On 31.07.2024 06:47, Andreas Lehmkühler wrote:
Bad news is there are a lot of new exceptions. Good news is, it
looks like they are all the same.
I'd a quick look and it seems to be related to [1]. I've tested
some of the pdfs and they all contain corrupt streams. I guess the
issue is a different exception handling in such cases. 3.0.2
catches such exceptions when reading corrupt streams and 3.0.3
seems to struggle and stops the parsing process.
Not the ones at the bottom of new_exceptions_in_B_details.xlsx (e.g.
500436.pdf), although these do also seem to be related to
PDFBOX-5675 too. They do a rewind() near the end of the stream.
I've found a fix for both cases. Please rerun the tests whenever you
have some cycles to do so.
Thanks in advance
Tilman
Andreas
[1] https://issues.apache.org/jira/browse/PDFBOX-5675
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org