I got a bit closer. IMHO it happens here:
private static void setNotNull(Property property, String value,
Metadata metadata) {
if (metadata.get(property) == null && !
StringUtils.isEmpty(value)) {
metadata.set(property, value);
}
}
if "value" is not empty but only spaces then the problem happens.
The PDF has a buggy XMP so you get no title from DublinCore but some
title from Basic. However this "some title" from Basic is just spaces
(which may or may not be a bug) and shouldn't be used. If this is
skipped then we have the old behavior.
Tilman
Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
Hi,
I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
There's something with the XMP metadata extraction. dc:title: is empty
(or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
I thought this could be related to some minor xmpbox changes but tika
doesn't use it. So I searched and found some changes in
PDMetadataExtractor.
I'm not yet sure if that is the cause, although I played around with
that one.
If it is, then it is related to
https://issues.apache.org/jira/browse/TIKA-3101
Tilman
Am 30.07.2020 um 12:43 schrieb Tim Allison:
Looks like there may be some issues with Japanese...don't know if
this is
related to your observation?
It feels like when I sort by ascending order of
NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
pairs
in the "lost common tokens".
Will look a bit more.
On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <[email protected]>
wrote:
Am 28.07.2020 um 23:51 schrieb Tim Allison:
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Thank you. Besides the exceptions, there are a few cases in content
extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
meaningful content, that is suspicious and needs further investigation.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]