@Tilman thanks for fixing this

Should we run another test before cutting the release?

Andreas

Am 03.06.23 um 05:53 schrieb Tilman Hausherr:
Thank you. This is related to PDFBOX-5606. parseNextToken() is closing the content stream if an error occurs, but it sometimes calls itself. Because of the closed content stream the method returns null, which is reported with the position. Trying to get the position on a closed stream throws the exception.

Tilman

On 02.06.2023 17:08, Tim Allison wrote:
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.29-pre-rc1-reports.tgz

One new exception which is reproducible with pure PDFBox app's ExtractText.

https://corpora.tika.apache.org/base/docs/govdocs1/819/819127.pdf

Exception in thread "main" org.apache.tika.exception.TikaException: Unable
to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:212)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:199)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:518)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256)
Caused by: java.io.IOException: Stream closed
at
java.base/java.io.PushbackInputStream.ensureOpen(PushbackInputStream.java:75) at java.base/java.io.PushbackInputStream.read(PushbackInputStream.java:132)
at
org.apache.pdfbox.pdfparser.InputStreamSource.read(InputStreamSource.java:47) at org.apache.pdfbox.pdfparser.BaseParser.skipSpaces(BaseParser.java:1257)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:138)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:548)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
at
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1370)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)

On Wed, May 31, 2023 at 1:41 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

Yes please

Thanks

Tilman

On 31.05.2023 17:15, Tim Allison wrote:
+1

Let me know when/if I should run the text extraction regression tests.

On Thu, May 25, 2023 at 12:32 PM sahy...@fileaffairs.de <
sahy...@fileaffairs.de> wrote:

+1

Maruan

Am Mittwoch, dem 24.05.2023 um 07:48 +0200 schrieb Andreas Lehmkuehler:
Hi,

I tend to release 2.0.29 soon due to the regression which was solved
with
PDFBOX-5606.

WDYT?

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to