Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-2.0.29-pre-rc1-reports.tgz
One new exception which is reproducible with pure PDFBox app's ExtractText. https://corpora.tika.apache.org/base/docs/govdocs1/819/819127.pdf Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:212) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:199) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164) at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:518) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256) Caused by: java.io.IOException: Stream closed at java.base/java.io.PushbackInputStream.ensureOpen(PushbackInputStream.java:75) at java.base/java.io.PushbackInputStream.read(PushbackInputStream.java:132) at org.apache.pdfbox.pdfparser.InputStreamSource.read(InputStreamSource.java:47) at org.apache.pdfbox.pdfparser.BaseParser.skipSpaces(BaseParser.java:1257) at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:138) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:548) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1370) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) On Wed, May 31, 2023 at 1:41 PM Tilman Hausherr <thaush...@t-online.de> wrote: > Yes please > > Thanks > > Tilman > > On 31.05.2023 17:15, Tim Allison wrote: > > +1 > > > > Let me know when/if I should run the text extraction regression tests. > > > > On Thu, May 25, 2023 at 12:32 PM sahy...@fileaffairs.de < > > sahy...@fileaffairs.de> wrote: > > > >> +1 > >> > >> Maruan > >> > >> Am Mittwoch, dem 24.05.2023 um 07:48 +0200 schrieb Andreas Lehmkuehler: > >>> Hi, > >>> > >>> I tend to release 2.0.29 soon due to the regression which was solved > >>> with > >>> PDFBOX-5606. > >>> > >>> WDYT? > >>> > >>> Andreas > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > >