The content diffs are impressive, the "more in A" column is almost fully
empty. There are only 3 files that might be relevant:
govdocs1/245/245359.doc _842859044.xls _842791279.doc
govdocs1/491/491561.ppt UNKNOWN-0.xls UNKNOWN-1.doc
govdocs1/752/752792.ppt UNKNOWN-0.xls UNKNOWN-1.doc
but I notice that the 2nd and 3rd column have different names.
Tilman
On 18.08.2023 00:21, Tim Allison wrote:
Current reports are here:
https://corpora.tika.apache.org/base/reports/tika-2.8.1-rand1m-xyz.tgz
I expect a bunch of ole2 files will have fewer attachments because we're no
longer duplicating/triplicating macros. I haven't had a chance to look,
but will look tomorrow.
On Tue, Aug 15, 2023 at 11:29 AM Tim Allison<[email protected]> wrote:
All,
I'm back from vacation. I had really hoped to run this release before I
left, but TIKA-4091 and TIKA-4048 left some surprises without quick fixes
available.
I'd like to fix small regressions left behind in TIKA-4091 (case
insensitive object names in OLE2), the new TIKA-4116 (duplicate macros in
some OLE2) and TIKA-4048 (the regression caused by setting extract all in
compressor parsers).
WIth those changes, I think we should increment the minor version -> 2.9.0.
Any blockers left for the next release? Any objections to the version
choice?
Best,
Tim