Re: PDFBox regression tests?

Andreas Lehmkuehler Fri, 31 Jul 2020 06:56:44 -0700

Am 31.07.20 um 08:27 schrieb Tilman Hausherr:

Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
I've looked at all the files I had highlighted yesterday. All differencesexcept two are related to the metadata problem.
The other two have a problem with spaces, i.e. glyphs not being near each other.

commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
This doesn't have to be a bug, I've seen many files where the extraction isbetter, so whatever change there is may have improved more things.
Thanks, for the analysis. IMHO we are good to cut a new release, aren't we?
Yeah we could.
But if the bug gets solved it would be nice to have a new diff output to see ifanything else gets shown more clearly.

I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there anythingelse we have to wait before we run the tests again, maybe some tika fix?


Andreas

Tilman


Tilman

Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:

Hi,

I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf

There's something with the XMP metadata extraction. dc:title: is empty (oran empty line and maybe spaces) in tika 1.25 but not in tika 1.24.

I thought this could be related to some minor xmpbox changes but tikadoesn't use it. So I searched and found some changes in PDMetadataExtractor.


I'm not yet sure if that is the cause, although I played around with that one.

If it is, then it is related to

https://issues.apache.org/jira/browse/TIKA-3101

Tilman

Am 30.07.2020 um 12:43 schrieb Tim Allison:

Looks like there may be some issues with Japanese...don't know if this is
related to your observation?

It feels like when I sort by ascending order of
NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
in the "lost common tokens".

Will look a bit more.

On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <[email protected]>
wrote:

Am 28.07.2020 um 23:51 schrieb Tim Allison:

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz


Thank you. Besides the exceptions, there are a few cases in content
extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
meaningful content, that is suspicious and needs further investigation.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDFBox regression tests?

Reply via email to