All,
  Apologies for my delay...

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz

I haven't had a chance to look at the reports yet. :(

I tried to update the instructions for running the process on the vm.
Please let me know if you have any questions, or if I need to make
improvements.

Thank you.

   Best,

              Tim

On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <[email protected]>
wrote:

> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> > Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> >> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> >>> I've looked at all the files I had highlighted yesterday. All
> differences
> >>> except two are related to the metadata problem.
> >>>
> >>> The other two have a problem with spaces, i.e. glyphs not being near
> each other.
> >>>
> >>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> >>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
> >>>
> >>> This doesn't have to be a bug, I've seen many files where the
> extraction is
> >>> better, so whatever change there is may have improved more things.
> >> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
> we?
> >
> >
> > Yeah we could.
> >
> > But if the bug gets solved it would be nice to have a new diff output to
> see if
> > anything else gets shown more clearly.
> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
> anything
> else we have to wait before we run the tests again, maybe some tika fix?
>
> Andreas
>
> > Tilman
> >
> >
> >
> >>
> >>
> >>>
> >>> Tilman
> >>>
> >>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> >>>> Hi,
> >>>>
> >>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
> >>>>
> >>>> There's something with the XMP metadata extraction. dc:title: is
> empty (or
> >>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
> >>>>
> >>>> I thought this could be related to some minor xmpbox changes but tika
> >>>> doesn't use it. So I searched and found some changes in
> PDMetadataExtractor.
> >>>>
> >>>> I'm not yet sure if that is the cause, although I played around with
> that one.
> >>>>
> >>>> If it is, then it is related to
> >>>>
> >>>> https://issues.apache.org/jira/browse/TIKA-3101
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
> >>>>> Looks like there may be some issues with Japanese...don't know if
> this is
> >>>>> related to your observation?
> >>>>>
> >>>>> It feels like when I sort by ascending order of
> >>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
> pairs
> >>>>> in the "lost common tokens".
> >>>>>
> >>>>> Will look a bit more.
> >>>>>
> >>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> >>>>>>> Reports are here:
> >>>>>>>
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> >>>>>>
> >>>>>> Thank you. Besides the exceptions, there are a few cases in content
> >>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
> has
> >>>>>> meaningful content, that is suspicious and needs further
> investigation.
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to