@Tilman thanks for fixing this
Should we run another test before cutting the release?
Andreas
Am 03.06.23 um 05:53 schrieb Tilman Hausherr:
Thank you. This is related to PDFBOX-5606. parseNextToken() is closing
the content stream if an error occurs, but it sometimes calls itself.
Because of the closed content stream the method returns null, which is
reported with the position. Trying to get the position on a closed
stream throws the exception.
Tilman
On 02.06.2023 17:08, Tim Allison wrote:
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.29-pre-rc1-reports.tgz
One new exception which is reproducible with pure PDFBox app's
ExtractText.
https://corpora.tika.apache.org/base/docs/govdocs1/819/819127.pdf
Exception in thread "main" org.apache.tika.exception.TikaException:
Unable
to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:212)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:199)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:518)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256)
Caused by: java.io.IOException: Stream closed
at
java.base/java.io.PushbackInputStream.ensureOpen(PushbackInputStream.java:75)
at
java.base/java.io.PushbackInputStream.read(PushbackInputStream.java:132)
at
org.apache.pdfbox.pdfparser.InputStreamSource.read(InputStreamSource.java:47)
at
org.apache.pdfbox.pdfparser.BaseParser.skipSpaces(BaseParser.java:1257)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:138)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:548)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
at
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1370)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
On Wed, May 31, 2023 at 1:41 PM Tilman Hausherr <thaush...@t-online.de>
wrote:
Yes please
Thanks
Tilman
On 31.05.2023 17:15, Tim Allison wrote:
+1
Let me know when/if I should run the text extraction regression tests.
On Thu, May 25, 2023 at 12:32 PM sahy...@fileaffairs.de <
sahy...@fileaffairs.de> wrote:
+1
Maruan
Am Mittwoch, dem 24.05.2023 um 07:48 +0200 schrieb Andreas
Lehmkuehler:
Hi,
I tend to release 2.0.29 soon due to the regression which was solved
with
PDFBOX-5606.
WDYT?
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org