a) In the SQLs, I see the *_a/*_b tables - so _a is then result of using
POI 4.1.1 and _b of POI 4.1.2?
a is tika 1.23 (which used 4.1.1), b is tika 1.x branch with 4.1.2 --
*WARNING -- diffs we observe may be changes in Tika btwn 1.23 and 1.x
branch.

b) Are the stats evaluated for both each time or is *_a cached from last
run?
I had to rerun 1.23 because I had wiped it out.

b) If a) is true, it's interesting that the attachment-missing* have such
similar numbers. I would expect one side to outweigh the other.
That is unexpected.  Aligning attachments is tricky if one version is
missing a version.  It is possible that this reflects failure to align.
I'll look into this.

c) I've checked one of metadata diffs (govdocs1/338/338907.ppt) and can't
reproduce/don't understand the values in the report
I've put the .json output here: http://162.242.228.174/share/338907_ppt.tgz.
I haven't looked yet, but will.

d) looking at the parse times: there are quite a few .ppt which only take
100-400ms in _a whereas in _b it takes them 3-5 sec.

That _may_ be caused by diffs in loads on the m|vm...other stuff going on
in the jvm.  Parse times per file can vary wildly
even with the same versions on different runs.  The key for me is the
rollup by parse time suggests _overall_ for ppt,
the time is nearly identical.


> On 07.02.20 13:05, Tim Allison wrote:
> > Hi All,,
> >   I haven't had the chance to look, but will do so later today::
> > http://162.242.228.174/reports/poi_4.1.2_reports.tgz
>
>
>

Reply via email to