Cool, thanks for the feedback. I've set the ticket to resolved.

Do we need to re-run the tests?

BTW, what about PDFBOX-5394? Is there anything left to do? Do we have to wait for the feedback of the user?

Andreas

Am 13.04.22 um 08:29 schrieb Tilman Hausherr:
Yeah, PDFBOX-5413 fixes that one as well. 👍

Tilman

Am 12.04.2022 um 19:26 schrieb Tilman Hausherr:
Only one left: 7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M.pdf .

There is some sort of problem with an incremental save, a part of the multi-content stream is missing / has a new object number. Lets wait whether it is related to PDFBOX-5413 .

(The other one, HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5.pdf is an improvement, I'll add it to my own tests)

Tilman

Am 12.04.2022 um 18:25 schrieb Tilman Hausherr:
Only
commoncrawl3/7L/7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M
commoncrawl3/HO/HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5
have a different text extraction

With the other two it's attachment file names or doc info.

Tilman

Am 12.04.2022 um 08:16 schrieb Tilman Hausherr:
After having looked at the content differences and trying to rule out the /Names differences, there are 4 files with content in TOP_10_MORE_IN_A that feel suspicious and IMHO need investigation.

commoncrawl3/7L/7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M
govdocs1/365/365260.pdf
commoncrawl3/HO/HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5
govdocs1/150/150282.pdf

Tilman



Am 12.04.2022 um 08:09 schrieb Andreas Lehmkuehler:
Thanks Tim!

Looks like there are 5 new exceptions left.

I'm going to check the first two ones

commoncrawl3/ZC/ZCY5MCL7KI6QXVMXUZ2AJKXICQIT4TL4
commoncrawl3/WY/WYPJNTD5KQNODSXWK4GABURXRTTD5P4H

The others are thrown within Jempbox ....


Andreas

Am 11.04.22 um 12:40 schrieb Tim Allison:
https://corpora.tika.apache.org/base/reports/tika-2.4-20220410.tgz

Haven't had a chance to review.  Hot off the vm.

On Sun, Apr 10, 2022 at 9:58 AM Tim Allison <talli...@apache.org> wrote:

Will try to kick off today…first thing Monday morning (EDT) at the latest.

On Sun, Apr 10, 2022 at 9:05 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote:

Am 09.04.22 um 19:00 schrieb Tilman Hausherr:
testFlattenPDFBOX2469Filled also fails in 2.0 (it is disabled by default).
I've fixed all new tickets. PDFBOX-5413 fixes the issue with the disabled
flatten test.

@Tim Is there any chance to re-run the tests?

Andreas


testFlattenPDFBOX2469Filled(org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest)
Time elapsed: 1.083 s  <<< ERROR!
java.io.IOException: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during decryption.
      at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.generateSamples(PDAcroFormFlattenTest.java:345)

      at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.flattenAndCompare(PDAcroFormFlattenTest.java:309)

      at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.testFlattenPDFBOX2469Filled(PDAcroFormFlattenTest.java:105)

Caused by: javax.crypto.BadPaddingException: Given final block not properly
padded. Such issues can arise if a bad key is used during decryption.
      at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.generateSamples(PDAcroFormFlattenTest.java:345)

      at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.flattenAndCompare(PDAcroFormFlattenTest.java:309)

      at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.testFlattenPDFBOX2469Filled(PDAcroFormFlattenTest.java:105)


I'm not creating an issue this time in case this is also related to another
known problem.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to