Tilman,
  Thank you for looking carefully at the reports!

> commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH
1Sonig is what we're getting in 2.3.0 and in the
2.4.0-soon-to-be-candidate, and it looks correct based on the
underlying xml and when I open it in LibreOffice.  It looks like it
was incorrectly put in a different cell or at least incorrectly
separated by a tab in 1.28.1.

>"file not fully read from stream"
This is a new exception in branch_1x because we made the ICNS parser
more strict than it was
(https://github.com/apache/tika/commit/ab709a5299be867c0e603116491faaa6546ed889#diff-6a7cb1f54ca026509b1eed5dabc7556d7e67fdfc2e68737d82f7e10f2550069a).
Note that the files are ~1MB, which means they are likely
CommonCrawlTruncated(TM).  I confirmed that they are truncated.  This
exception is the behavior in the 2.x branch.



On Thu, Apr 28, 2022 at 2:31 AM Tilman Hausherr <thaush...@t-online.de> wrote:
>
> Am 28.04.2022 um 00:25 schrieb Tim Allison:
> > Are available here:
> > https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz
> >
> > I haven't taken a look yet.
> >
> > Let me know if you find anything.
>
>
> commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH
>
> this is minor and is related to superscript, I don't know if this is
> wanted or not.
>
> The two "file not fully read from stream" exceptions, am I correct to
> assume that these are problems in the batch itself?
>
> Tilman
>

Reply via email to