Only
commoncrawl3/7L/7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M
commoncrawl3/HO/HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5
have a different text extraction
With the other two it's attachment file names or doc info.
Tilman
Am 12.04.2022 um 08:16 schrieb Tilman Hausherr:
After having looked at the content differences and trying to rule out
the /Names differences, there are 4 files with content in
TOP_10_MORE_IN_A that feel suspicious and IMHO need investigation.
commoncrawl3/7L/7LRS5U6CAFMN2P6JPTZVNBUW6XOFYH4M
govdocs1/365/365260.pdf
commoncrawl3/HO/HOAZTST4E26NPA7HL72WCIVMNRQ3E4M5
govdocs1/150/150282.pdf
Tilman
Am 12.04.2022 um 08:09 schrieb Andreas Lehmkuehler:
Thanks Tim!
Looks like there are 5 new exceptions left.
I'm going to check the first two ones
commoncrawl3/ZC/ZCY5MCL7KI6QXVMXUZ2AJKXICQIT4TL4
commoncrawl3/WY/WYPJNTD5KQNODSXWK4GABURXRTTD5P4H
The others are thrown within Jempbox ....
Andreas
Am 11.04.22 um 12:40 schrieb Tim Allison:
https://corpora.tika.apache.org/base/reports/tika-2.4-20220410.tgz
Haven't had a chance to review. Hot off the vm.
On Sun, Apr 10, 2022 at 9:58 AM Tim Allison <talli...@apache.org>
wrote:
Will try to kick off today…first thing Monday morning (EDT) at the
latest.
On Sun, Apr 10, 2022 at 9:05 AM Andreas Lehmkuehler
<andr...@lehmi.de> wrote:
Am 09.04.22 um 19:00 schrieb Tilman Hausherr:
testFlattenPDFBOX2469Filled also fails in 2.0 (it is disabled by
default).
I've fixed all new tickets. PDFBOX-5413 fixes the issue with the
disabled
flatten test.
@Tim Is there any chance to re-run the tests?
Andreas
testFlattenPDFBOX2469Filled(org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest)
Time elapsed: 1.083 s <<< ERROR!
java.io.IOException: javax.crypto.BadPaddingException: Given
final block not
properly padded. Such issues can arise if a bad key is used
during decryption.
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.generateSamples(PDAcroFormFlattenTest.java:345)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.flattenAndCompare(PDAcroFormFlattenTest.java:309)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.testFlattenPDFBOX2469Filled(PDAcroFormFlattenTest.java:105)
Caused by: javax.crypto.BadPaddingException: Given final block
not properly
padded. Such issues can arise if a bad key is used during
decryption.
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.generateSamples(PDAcroFormFlattenTest.java:345)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.flattenAndCompare(PDAcroFormFlattenTest.java:309)
at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroFormFlattenTest.testFlattenPDFBOX2469Filled(PDAcroFormFlattenTest.java:105)
I'm not creating an issue this time in case this is also related
to another
known problem.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org