https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-rc1.tgz

On Mon, Apr 10, 2023 at 7:02 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote:
>
> Sounds like one of those expected issues. I guess PDFBox now swallows the 
> former
> exception and is able to process the pdf in question. At least the exception 
> is
> gone, maybe there is some more content or just an empty page.
>
> However, IMHO that isn't a regression, but an (small) improvement.
>
> @Tim Thanks for running the tests
>
> Andreas
>
>
> Am 10.04.23 um 12:54 schrieb Tim Allison:
> > We're getting a build failure on this test now.  I'll turn it off for
> > the build and run the full process.
> >
> > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java#L1031
> >
> > org.opentest4j.AssertionFailedError: Should have thrown exception ==>
> > expected: <true> but was: <false>
> >      at 
> > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> >      at 
> > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> >      at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> >      at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> >      at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211)
> >      at 
> > org.apache.tika.parser.pdf.PDFParserTest.testSkipBadPage(PDFParserTest.java:1044)
> >
> > On Mon, Apr 10, 2023 at 6:41 AM Tim Allison <talli...@apache.org> wrote:
> >>
> >> Y. Will start process now. Thank you!
> >>
> >> On Mon, Apr 10, 2023 at 6:20 AM Andreas Lehmkuehler <andr...@lehmi.de> 
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I've finished the release process and provided a releases candidate for 
> >>> 2.0.28
> >>>
> >>> @Tim Is there any chance to re-run the tests in the next 3 days, so that 
> >>> we
> >>> could stop the release if there is any major regression.
> >>>
> >>> I don't expect any new issue as the last changes should produces less 
> >>> exceptions
> >>> than before but you knows ....
> >>>
> >>> Thanks in advance
> >>>
> >>> Andreas
> >>>
> >>>
> >>> Am 10.04.23 um 11:42 schrieb Andreas Lehmkuehler:
> >>>>
> >>>> Am 10.04.23 um 04:32 schrieb Tilman Hausherr:
> >>>>> On 09.04.2023 22:36, Andreas Lehmkuehler wrote:
> >>>>>> OK, so there is one more question left: do we need to re-run the tests 
> >>>>>> before
> >>>>>> starting the release process?
> >>>>>
> >>>>> Yes I prefer to have another comparison, but it can be done in parallel.
> >>>> Good idea, I'm going to cut the release ...
> >>>>
> >>>> Andreas
> >>>>
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Andreas
> >>>>>>
> >>>>>> Am 09.04.23 um 20:56 schrieb Tilman Hausherr:
> >>>>>>> On 09.04.2023 17:35, Andreas Lehmkuehler wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I've fixed the issue with 2 of the 3 pdfs.
> >>>>>>>>
> >>>>>>>> GHOSTSCRIPT-702891-0.pdf is left as the only problematic pdf. I 
> >>>>>>>> didn't
> >>>>>>>> found a solution which fixes the regressions and still fixes the 
> >>>>>>>> origin
> >>>>>>>> issue from PDFBOX-5178. The parser from the trunk is able to handle 
> >>>>>>>> that
> >>>>>>>> pdf well.
> >>>>>>>>
> >>>>>>>> IMHO we should leave it alone, as it is malformed anmd doesn't 
> >>>>>>>> contain any
> >>>>>>>> useful content. More important, it is one pdf out of hundreds of
> >>>>>>>> thoudsands, just a corner cases.
> >>>>>>>>
> >>>>>>>> WDYT?
> >>>>>>>
> >>>>>>> I agree!
> >>>>>>>
> >>>>>>> Tilman
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Andreas
> >>>>>>>>
> >>>>>>>> Am 05.04.23 um 08:10 schrieb Andreas Lehmkuehler:
> >>>>>>>>> Am 04.04.23 um 07:40 schrieb Andreas Lehmkuehler:
> >>>>>>>>>> Am 03.04.23 um 19:50 schrieb Tim Allison:
> >>>>>>>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-20230403-reports.tgz
> >>>>>>>>>>>
> >>>>>>>>>>> Haven't had a chance to take a look yet. :(
> >>>>>>>>>> Thanks Tim!
> >>>>>>>>>>
> >>>>>>>>>> There are still 5 new exceptions listed. All of them are related 
> >>>>>>>>>> to the
> >>>>>>>>>> very same change coming from PDFBOX-5178 which I've fixed the 
> >>>>>>>>>> other day.
> >>>>>>>>>> But these cases are different and the trunk is affected as well. 
> >>>>>>>>>> My bad
> >>>>>>>>>> to not have a deeper look in the first place.
> >>>>>>>>>>
> >>>>>>>>>> I'm going to investigate those issues
> >>>>>>>>> All pdfs are more or less broken. Two of them are totally useless 
> >>>>>>>>> and the
> >>>>>>>>> new exception is just another one. The other three contain some 
> >>>>>>>>> more or
> >>>>>>>>> less readable content and we are hitting the well know dilemma: 
> >>>>>>>>> should we
> >>>>>>>>> stop reading once we hit something bad or should we try to read as 
> >>>>>>>>> much as
> >>>>>>>>> possible and maybe run into much bigger issues than before.
> >>>>>>>>>
> >>>>>>>>> I guess these are all some special corner cases. I'm still thinking 
> >>>>>>>>> about
> >>>>>>>>> a solution to support both strategies.
> >>>>>>>>>
> >>>>>>>>> Andreas
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Andreas
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Apr 3, 2023 at 6:53 AM Tilman Hausherr 
> >>>>>>>>>>> <thaush...@t-online.de>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Don't wait please
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>> Tilman
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --- Original-Nachricht ---
> >>>>>>>>>>>> Von: Tim Allison
> >>>>>>>>>>>> Betreff: Re: Fwd: 2.0.28 release?
> >>>>>>>>>>>> Datum: 03. April 2023, 12:47
> >>>>>>>>>>>> An: dev@pdfbox.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Y. I can kick that off now. Or should I wait?
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sat, Apr 1, 2023 at 2:06 PM Andreas Lehmkuehler 
> >>>>>>>>>>>> <andr...@lehmi.de
> >>>>>>>>>>>> <mailto:andr...@lehmi.de> > wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> @Tim <mailto:@Tim>
> >>>>>>>>>>>>> Is there any chance to re-run the tests?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Andreas
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Am 01.04.23 um 17:08 schrieb Andreas Lehmkuehler:
> >>>>>>>>>>>>>> Am 01.04.23 um 17:05 schrieb Andreas Lehmkuehler:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've accidentally send this to Tim only :-|
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -------- Weitergeleitete Nachricht --------
> >>>>>>>>>>>>>>> Betreff: Re: 2.0.28 release?
> >>>>>>>>>>>>>>> Datum: Fri, 31 Mar 2023 07:50:10 +0200
> >>>>>>>>>>>>>>> Von: Andreas Lehmkuehler <andr...@lehmi.de 
> >>>>>>>>>>>>>>> <mailto:andr...@lehmi.de>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> An: Tim Allison <talli...@apache.org 
> >>>>>>>>>>>>>>> <mailto:talli...@apache.org> >
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Am 30.03.23 um 16:27 schrieb Tim Allison:
> >>>>>>>>>>>>>>>> Reports are here:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>> <https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-SNAPSHOT.tgz>
> >>>>>>>>>>>>>>> Thanks Tim.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Looks like we have a regression. There is a handful of new 
> >>>>>>>>>>>>>>> exceptions.
> >>>>>>>>>>>> Some of
> >>>>>>>>>>>>>>> them just replace another exception and it is unclear if the 
> >>>>>>>>>>>>>>> result is
> >>>>>>>>>>>> better
> >>>>>>>>>>>>>>> or worse. But at least one of the pdfs works in 2.0.27 and 
> >>>>>>>>>>>>>>> doesn't in
> >>>>>>>>>>>> 2.0.28
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> bug_trackers/PDFBOX/PDFBOX-4424-1.pdf
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'll have a look
> >>>>>>>>>>>>>> The regression was related to PDFBOX-5178. I've fixed it so 
> >>>>>>>>>>>>>> that the
> >>>>>>>>>>>> exceptions
> >>>>>>>>>>>>>> should be gone.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Andreas
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Andreas
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 10:42 PM Tilman Hausherr <
> >>>>>>>>>>>> thaush...@t-online.de <mailto:thaush...@t-online.de> > wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes please!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Tilman
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 28.03.2023 19:22, Tim Allison wrote:
> >>>>>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Should I run the regression tests now or is there anything 
> >>>>>>>>>>>>>>>>>> else
> >>>>>>>>>>>> text
> >>>>>>>>>>>>>>>>>> related that is still being worked on?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Mar 28, 2023 at 1:05 PM Tilman Hausherr <
> >>>>>>>>>>>> thaush...@t-online.de <mailto:thaush...@t-online.de> > wrote:
> >>>>>>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Tilman
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 28.03.2023 08:46, Andreas Lehmkuehler wrote:
> >>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> how about cutting a 2.0.28 release next week on Monday?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> there is a bunch of solved tickets and the last release 
> >>>>>>>>>>>>>>>>>>>> dates
> >>>>>>>>>>>> back 6
> >>>>>>>>>>>>>>>>>>>> months
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Andreas
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>>>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to