[
https://issues.apache.org/jira/browse/PDFBOX-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013852#comment-17013852
]
Tilman Hausherr commented on PDFBOX-4739:
-----------------------------------------
This is surprising, I was able to render the file with -Xmx80m on jdk13 and
-Xmx200m on jdk8 when using PDFToImage with 300dpi and tiff output.
When running PDFDebugger on the profiler, 70MB are used before opening the
file. This could be the fonts and the colors.
> Memory issues when rendering pdf to image
> -----------------------------------------
>
> Key: PDFBOX-4739
> URL: https://issues.apache.org/jira/browse/PDFBOX-4739
> Project: PDFBox
> Issue Type: Bug
> Components: Rendering
> Affects Versions: 2.0.18
> Reporter: Lior Yaffe
> Priority: Blocker
> Attachments: linkedinceoresume.pdf
>
>
> So I'm trying to write a web service which performs OCR on an input pdf files.
> The code is very simple - convert the pdf to tiff files using PDFBox, and
> then use tesseract on the tiff files to get text.
> code is very straight forward:
>
> {code:java}
> private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
> List<ByteArrayOutputStream> fileList = new ArrayList<>();
> PDDocument doc = PDDocument.load(this.bytes);
> doc.setResourceCache(new EmptyCache());
> try {
> PDFRenderer pdfRenderer = new PDFRenderer(doc);
> for (int page = 0; page < doc.getNumberOfPages(); ++page) {
> BufferedImage bufferedImage =
> pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
> calcImageSize(bufferedImage);
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> ImageIO.write(bufferedImage, "tiff", os);
> os.flush();
> os.close();
> bufferedImage.flush();
> bufferedImage = null;
> fileList.add(os);
> }
> } finally {
> doc.close();
> }
> return fileList;
> }
> {code}
>
> I'm trying to run a sample test which runs this concurrent with 5-6 different
> threads, but the app is crashing very fast.
>
> I did some memory tests, and it seems that while the input file is around 70
> kb, the
> {code:java}
> pdfRenderer
> {code}
> object is around 300 MB!! no matter if i'm changing the DPI level, the object
> is still very large.
> in addition, only if I'm calling the GC I see the memory drops, even if I'm
> closing the doc object....
>
> Basically when I'm running my server with -Xmx6GB with 6 threads in
> concurrent, after 3 runs the service is crashing....what am I missing here?
>
> * I attached the input pdf file
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]