Am 03.08.24 um 05:48 schrieb Tilman Hausherr:
I thought I had posted the link, but it seems I didn't?! Here it is
https://home.snafu.de/tilman/tmp/reports_pdfbox_3.0.2_vs_3.0.3_2.tar.xz
I had a look at the new exception. The changes from [1] are responsible for the different behavior. IMHO that issue can be ignored. The file is a mess and 3.0.2 isn't able to extract anything. 3.0.3 simply produces another error message.

@Tilman
Are there any issues with the text extraction, otherwise I'm going to cut the release this evening.

Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5786

Tilman

On 01.08.2024 10:47, Tilman Hausherr wrote:
Thanks, I'll run another "B" and eval job but with the change from PDFBOX-5790 reverted, like I did for 2.0.32, so we get less noise in the content results.

Tilman

On 01.08.2024 07:56, Andreas Lehmkühler wrote:


Am 31.07.24 um 11:45 schrieb Tilman Hausherr:
On 31.07.2024 06:47, Andreas Lehmkühler wrote:
Bad news is there are a lot of new exceptions. Good news is, it looks like they are all the same.

I'd a quick look and it seems to be related to [1]. I've tested some of the pdfs and they all contain corrupt streams. I guess the issue is a different exception handling in such cases. 3.0.2 catches such exceptions when reading corrupt streams and 3.0.3 seems to struggle and stops the parsing process.

Not the ones at the bottom of new_exceptions_in_B_details.xlsx (e.g. 500436.pdf), although these do also seem to be related to PDFBOX-5675 too. They do a rewind() near the end of the stream.
I've found a fix for both cases. Please rerun the tests whenever you have some cycles to do so.

Thanks in advance



Tilman





Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5675


Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to