Hi Tilman, Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not related to TIKA-3734. The updated junrar (7.5.0) is swallowing a (new) exception on this file and stopping the parse without throwing an exception. The earlier version of junrar (7.4.1) did not find a problem with the file.
My ubuntu package util throws an exception on this file, and I think it is just kind of wonky. I'm going to fix the dependency convergence issues. Is there anything else? Best, Tim On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr <thaush...@t-online.de> wrote: > > Am 26.04.2022 um 13:07 schrieb Tim Allison: > > Reports are here: > > https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz > > > > I found two issues that should be fixed (TIKA-3733 and TIKA-3734). I > > think both are related to the underlying parsers being stricter (which > > is good), but we need to change our code to handle these cases more > > robustly. > > > > Let me know if you see anything else. > > What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is > also a rar file and the last entry in content_diffs_no_exceptions.xlsx . > Is that related to TIKA-3734 ? > > Tilman >