Looks like there may be some issues with Japanese...don't know if this is
related to your observation?

It feels like when I sort by ascending order of
NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
in the "lost common tokens".

Will look a bit more.

On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <[email protected]>
wrote:

> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>
>
> Thank you. Besides the exceptions, there are a few cases in content
> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
> meaningful content, that is suspicious and needs further investigation.
>
> Tilman
>
>

Reply via email to