Looks like there may be some issues with Japanese...don't know if this is related to your observation?
It feels like when I sort by ascending order of NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs in the "lost common tokens". Will look a bit more. On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <[email protected]> wrote: > Am 28.07.2020 um 23:51 schrieb Tim Allison: > > Reports are here: > > https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz > > > Thank you. Besides the exceptions, there are a few cases in content > extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has > meaningful content, that is suspicious and needs further investigation. > > Tilman > >
