Cool, thanks for the feedback. I've set the ticket to resolved.
Do we need to re-run the tests?
BTW, what about PDFBOX-5394? Is there anything left to do? Do we have to wait
for the feedback of the user?
Andreas
Am 13.04.22 um 08:29 schrieb Tilman Hausherr:
Yeah, PDFBOX-5413 fixes that one as well. 👍
Tilman
Am 12.04.2022 um 19:26 schrieb Tilman Hausherr:
Only one left: 7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M.pdf .
There is some sort of problem with an incremental save, a part of the
multi-content stream is missing / has a new object number. Lets wait whether
it is related to PDFBOX-5413 .
(The other one, HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5.pdf is an improvement, I'll
add it to my own tests)
Tilman
Am 12.04.2022 um 18:25 schrieb Tilman Hausherr:
Only
commoncrawl3/7L/7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M
commoncrawl3/HO/HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5
have a different text extraction
With the other two it's attachment file names or doc info.
Tilman
Am 12.04.2022 um 08:16 schrieb Tilman Hausherr:
After having looked at the content differences and trying to rule out the
/Names differences, there are 4 files with content in TOP_10_MORE_IN_A that
feel suspicious and IMHO need investigation.
commoncrawl3/7L/7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M
govdocs1/365/365260.pdf
commoncrawl3/HO/HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5
govdocs1/150/150282.pdf
Tilman
Am 12.04.2022 um 08:09 schrieb Andreas Lehmkuehler:
Thanks Tim!
Looks like there are 5 new exceptions left.
I'm going to check the first two ones
commoncrawl3/ZC/ZCY5MCL7KI6QXVMXUZ2AJKXICQIT4TL4
commoncrawl3/WY/WYPJNTD5KQNODSXWK4GABURXRTTD5P4H
The others are thrown within Jempbox ....
Andreas
Am 11.04.22 um 12:40 schrieb Tim Allison:
https://corpora.tika.apache.org/base/reports/tika-2.4-20220410.tgz
Haven't had a chance to review. Hot off the vm.
On Sun, Apr 10, 2022 at 9:58 AM Tim Allison <talli...@apache.org> wrote:
Will try to kick off today…first thing Monday morning (EDT) at the latest.
On Sun, Apr 10, 2022 at 9:05 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:
Am 09.04.22 um 19:00 schrieb Tilman Hausherr:
testFlattenPDFBOX2469Filled also fails in 2.0 (it is disabled by default).
I've fixed all new tickets. PDFBOX-5413 fixes the issue with the disabled
flatten test.
@Tim Is there any chance to re-run the tests?
Andreas
testFlattenPDFBOX2469Filled(org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest)
Time elapsed: 1.083 s <<< ERROR!
java.io.IOException: javax.crypto.BadPaddingException: Given final
block not
properly padded. Such issues can arise if a bad key is used during
decryption.
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.generateSamples(PDAcroFormFlattenTest.java:345)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.flattenAndCompare(PDAcroFormFlattenTest.java:309)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.testFlattenPDFBOX2469Filled(PDAcroFormFlattenTest.java:105)
Caused by: javax.crypto.BadPaddingException: Given final block not
properly
padded. Such issues can arise if a bad key is used during decryption.
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.generateSamples(PDAcroFormFlattenTest.java:345)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.flattenAndCompare(PDAcroFormFlattenTest.java:309)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.testFlattenPDFBOX2469Filled(PDAcroFormFlattenTest.java:105)
I'm not creating an issue this time in case this is also related to
another
known problem.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org