Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.29-pre-rc1-reports.tgz

One new exception which is reproducible with pure PDFBox app's ExtractText.

https://corpora.tika.apache.org/base/docs/govdocs1/819/819127.pdf

Exception in thread "main" org.apache.tika.exception.TikaException: Unable
to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:212)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:199)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:518)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256)
Caused by: java.io.IOException: Stream closed
at
java.base/java.io.PushbackInputStream.ensureOpen(PushbackInputStream.java:75)
at java.base/java.io.PushbackInputStream.read(PushbackInputStream.java:132)
at
org.apache.pdfbox.pdfparser.InputStreamSource.read(InputStreamSource.java:47)
at org.apache.pdfbox.pdfparser.BaseParser.skipSpaces(BaseParser.java:1257)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:138)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:548)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
at
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1370)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)

On Wed, May 31, 2023 at 1:41 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Yes please
>
> Thanks
>
> Tilman
>
> On 31.05.2023 17:15, Tim Allison wrote:
> > +1
> >
> > Let me know when/if I should run the text extraction regression tests.
> >
> > On Thu, May 25, 2023 at 12:32 PM sahy...@fileaffairs.de <
> > sahy...@fileaffairs.de> wrote:
> >
> >> +1
> >>
> >> Maruan
> >>
> >> Am Mittwoch, dem 24.05.2023 um 07:48 +0200 schrieb Andreas Lehmkuehler:
> >>> Hi,
> >>>
> >>> I tend to release 2.0.29 soon due to the regression which was solved
> >>> with
> >>> PDFBOX-5606.
> >>>
> >>> WDYT?
> >>>
> >>> Andreas
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>

Reply via email to