Re: 1.28.2 regression results

2022-04-28 Thread Tim Allison
Tilman,
  Thank you for looking carefully at the reports!

> commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH
1Sonig is what we're getting in 2.3.0 and in the
2.4.0-soon-to-be-candidate, and it looks correct based on the
underlying xml and when I open it in LibreOffice.  It looks like it
was incorrectly put in a different cell or at least incorrectly
separated by a tab in 1.28.1.

>"file not fully read from stream"
This is a new exception in branch_1x because we made the ICNS parser
more strict than it was
(https://github.com/apache/tika/commit/ab709a5299be867c0e603116491faaa6546ed889#diff-6a7cb1f54ca026509b1eed5dabc7556d7e67fdfc2e68737d82f7e10f2550069a).
Note that the files are ~1MB, which means they are likely
CommonCrawlTruncated(TM).  I confirmed that they are truncated.  This
exception is the behavior in the 2.x branch.



On Thu, Apr 28, 2022 at 2:31 AM Tilman Hausherr  wrote:
>
> Am 28.04.2022 um 00:25 schrieb Tim Allison:
> > Are available here:
> > https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz
> >
> > I haven't taken a look yet.
> >
> > Let me know if you find anything.
>
>
> commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH
>
> this is minor and is related to superscript, I don't know if this is
> wanted or not.
>
> The two "file not fully read from stream" exceptions, am I correct to
> assume that these are problems in the batch itself?
>
> Tilman
>


Re: 1.28.2 regression results

2022-04-27 Thread Tilman Hausherr

Am 28.04.2022 um 00:25 schrieb Tim Allison:

Are available here:
https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz

I haven't taken a look yet.

Let me know if you find anything.



commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH

this is minor and is related to superscript, I don't know if this is 
wanted or not.


The two "file not fully read from stream" exceptions, am I correct to 
assume that these are problems in the batch itself?


Tilman



1.28.2 regression results

2022-04-27 Thread Tim Allison
Are available here:
https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz

I haven't taken a look yet.

Let me know if you find anything.

Best,

  Tim


Re: 1.28.2 regression results

2022-04-26 Thread Tilman Hausherr

Am 26.04.2022 um 21:45 schrieb Tim Allison:

I should clarify that I fixed the two regressions that I had
identified in the release candidate.  The regression results that I
shared were run with 1.x before those fixes.


Ah ok, but then the tests should be run again after the fixes in case 
something got broken by the fix (it happened in the pdfbox project).  If 
nothing got broken, then there's still the satisfaction of having very 
small result files :-)


Also suspicious:

bug_trackers/TIKA/TIKA-2215-0.ppt


Tilman




Still, let's fix the dependency convergence, and please let me know if
there's anything else you find in the regression reports!

On Tue, Apr 26, 2022 at 3:40 PM Tim Allison  wrote:

Hi Tilman,

   Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
related to TIKA-3734.  The updated junrar (7.5.0) is swallowing a
(new) exception on this file and stopping the parse without throwing
an exception.  The earlier version of junrar (7.4.1) did not find a
problem with the file.

   My ubuntu package util throws an exception on this file, and I think
it is just kind of wonky.

   I'm going to fix the dependency convergence issues.  Is there anything else?

   Best,

  Tim

On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr  wrote:

Am 26.04.2022 um 13:07 schrieb Tim Allison:

Reports are here:
https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz

I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
think both are related to the underlying parsers being stricter (which
is good), but we need to change our code to handle these cases more
robustly.

Let me know if you see anything else.

What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
Is that related to TIKA-3734 ?

Tilman





Re: 1.28.2 regression results

2022-04-26 Thread Tim Allison
I should clarify that I fixed the two regressions that I had
identified in the release candidate.  The regression results that I
shared were run with 1.x before those fixes.

Still, let's fix the dependency convergence, and please let me know if
there's anything else you find in the regression reports!

On Tue, Apr 26, 2022 at 3:40 PM Tim Allison  wrote:
>
> Hi Tilman,
>
>   Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
> related to TIKA-3734.  The updated junrar (7.5.0) is swallowing a
> (new) exception on this file and stopping the parse without throwing
> an exception.  The earlier version of junrar (7.4.1) did not find a
> problem with the file.
>
>   My ubuntu package util throws an exception on this file, and I think
> it is just kind of wonky.
>
>   I'm going to fix the dependency convergence issues.  Is there anything else?
>
>   Best,
>
>  Tim
>
> On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr  wrote:
> >
> > Am 26.04.2022 um 13:07 schrieb Tim Allison:
> > > Reports are here:
> > > https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz
> > >
> > > I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
> > > think both are related to the underlying parsers being stricter (which
> > > is good), but we need to change our code to handle these cases more
> > > robustly.
> > >
> > > Let me know if you see anything else.
> >
> > What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
> > also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
> > Is that related to TIKA-3734 ?
> >
> > Tilman
> >


Re: 1.28.2 regression results

2022-04-26 Thread Tim Allison
Hi Tilman,

  Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
related to TIKA-3734.  The updated junrar (7.5.0) is swallowing a
(new) exception on this file and stopping the parse without throwing
an exception.  The earlier version of junrar (7.4.1) did not find a
problem with the file.

  My ubuntu package util throws an exception on this file, and I think
it is just kind of wonky.

  I'm going to fix the dependency convergence issues.  Is there anything else?

  Best,

 Tim

On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr  wrote:
>
> Am 26.04.2022 um 13:07 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz
> >
> > I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
> > think both are related to the underlying parsers being stricter (which
> > is good), but we need to change our code to handle these cases more
> > robustly.
> >
> > Let me know if you see anything else.
>
> What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
> also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
> Is that related to TIKA-3734 ?
>
> Tilman
>


Re: 1.28.2 regression results

2022-04-26 Thread Tilman Hausherr

Am 26.04.2022 um 13:07 schrieb Tim Allison:

Reports are here:
https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz

I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
think both are related to the underlying parsers being stricter (which
is good), but we need to change our code to handle these cases more
robustly.

Let me know if you see anything else.


What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is 
also a rar file and the last entry in content_diffs_no_exceptions.xlsx . 
Is that related to TIKA-3734 ?


Tilman



Re: 1.28.2 regression results

2022-04-26 Thread Tilman Hausherr




Let me know if you see anything else.


The jdk11 and 17 builds fail because of a dependency convergence error. 
I don't know if this is really relevant, i.e. would the jdk8 build still 
be ok for people using tika on jdk11 and 17 ?


Tilman



1.28.2 regression results

2022-04-26 Thread Tim Allison
Reports are here:
https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz

I found two issues that should be fixed (TIKA-3733 and TIKA-3734).  I
think both are related to the underlying parsers being stricter (which
is good), but we need to change our code to handle these cases more
robustly.

Let me know if you see anything else.