Tilman, Thank you for looking carefully at the reports! > commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH 1Sonig is what we're getting in 2.3.0 and in the 2.4.0-soon-to-be-candidate, and it looks correct based on the underlying xml and when I open it in LibreOffice. It looks like it was incorrectly put in a different cell or at least incorrectly separated by a tab in 1.28.1.
>"file not fully read from stream" This is a new exception in branch_1x because we made the ICNS parser more strict than it was (https://github.com/apache/tika/commit/ab709a5299be867c0e603116491faaa6546ed889#diff-6a7cb1f54ca026509b1eed5dabc7556d7e67fdfc2e68737d82f7e10f2550069a). Note that the files are ~1MB, which means they are likely CommonCrawlTruncated(TM). I confirmed that they are truncated. This exception is the behavior in the 2.x branch. On Thu, Apr 28, 2022 at 2:31 AM Tilman Hausherr <thaush...@t-online.de> wrote: > > Am 28.04.2022 um 00:25 schrieb Tim Allison: > > Are available here: > > https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz > > > > I haven't taken a look yet. > > > > Let me know if you find anything. > > > commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH > > this is minor and is related to superscript, I don't know if this is > wanted or not. > > The two "file not fully read from stream" exceptions, am I correct to > assume that these are problems in the batch itself? > > Tilman >