I confirmed that the drop in attachments is actually a good thing in the
handful of files that I randomly sampled. One file had 9 copies of the
same macro.
The mismatch in tika-eval's attachment matching is troubling, but I think
we're good to go. I'll kick off the release process tomorrow morning (ET)
unless there are objections.
Best,
Tim
On Sun, Aug 20, 2023 at 7:21 AM Tilman Hausherr <[email protected]>
wrote:
> The content diffs are impressive, the "more in A" column is almost fully
> empty. There are only 3 files that might be relevant:
>
> govdocs1/245/245359.doc _842859044.xls _842791279.doc
> govdocs1/491/491561.ppt UNKNOWN-0.xls UNKNOWN-1.doc
> govdocs1/752/752792.ppt UNKNOWN-0.xls UNKNOWN-1.doc
>
>
> but I notice that the 2nd and 3rd column have different names.
>
> Tilman
>
> On 18.08.2023 00:21, Tim Allison wrote:
> > Current reports are here:
> > https://corpora.tika.apache.org/base/reports/tika-2.8.1-rand1m-xyz.tgz
> >
> > I expect a bunch of ole2 files will have fewer attachments because we're
> no
> > longer duplicating/triplicating macros. I haven't had a chance to look,
> > but will look tomorrow.
> >
> > On Tue, Aug 15, 2023 at 11:29 AM Tim Allison<[email protected]>
> wrote:
> >
> >> All,
> >>
> >> I'm back from vacation. I had really hoped to run this release before I
> >> left, but TIKA-4091 and TIKA-4048 left some surprises without quick
> fixes
> >> available.
> >>
> >> I'd like to fix small regressions left behind in TIKA-4091 (case
> >> insensitive object names in OLE2), the new TIKA-4116 (duplicate macros
> in
> >> some OLE2) and TIKA-4048 (the regression caused by setting extract all
> in
> >> compressor parsers).
> >>
> >> WIth those changes, I think we should increment the minor version ->
> 2.9.0.
> >>
> >> Any blockers left for the next release? Any objections to the version
> >> choice?
> >>
> >>
> >> Best,
> >>
> >> Tim
> >>
> >>
> >>
>