[
https://issues.apache.org/jira/browse/PDFBOX-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lior Yaffe updated PDFBOX-4739:
-------------------------------
Description:
So I'm trying to write a web service which performs OCR on an input pdf files.
The code is very simple - convert the pdf to tiff files using PDFBox, and then
use tesseract on the tiff files to get text.
code is very straight forward:
{code:java}
private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
List<ByteArrayOutputStream> fileList = new ArrayList<>();
PDDocument doc = PDDocument.load(this.bytes);
doc.setResourceCache(new EmptyCache());
try {
PDFRenderer pdfRenderer = new PDFRenderer(doc);
for (int page = 0; page < doc.getNumberOfPages(); ++page) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page,
300, ImageType.RGB);
calcImageSize(bufferedImage);
ByteArrayOutputStream os = new ByteArrayOutputStream();
ImageIO.write(bufferedImage, "tiff", os);
os.flush();
os.close();
bufferedImage.flush();
bufferedImage = null;
fileList.add(os);
}
} finally {
doc.close();
}
return fileList;
}
{code}
I'm trying to run a sample test which runs this concurrent with 5-6 different
threads, but the app is crashing very fast.
I did some memory tests, and it seems that while the input file is around 70
kb, the
{code:java}
pdfRenderer
{code}
object is around 300 MB!! no matter if i'm changing the DPI level, the object
is still very large.
in addition, only if I'm calling the GC I see the memory drops, even if I'm
closing the doc object....
Basically when I'm running my server with -Xmx6GB with 6 threads in concurrent,
after 3 runs the service is crashing....what am I missing here?
* I attached the input pdf file
was:
So I'm trying to write a web service which performs OCR on an input pdf files.
The code is very simple - convert the pdf to tiff files using PDFBox, and then
use tesseract on the tiff files to get text.
code is very straight forward:
{code:java}
private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
List<ByteArrayOutputStream> fileList = new ArrayList<>();
PDDocument doc = PDDocument.load(this.bytes);
doc.setResourceCache(new EmptyCache());
try {
PDFRenderer pdfRenderer = new PDFRenderer(doc);
for (int page = 0; page < doc.getNumberOfPages(); ++page) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0,
300, ImageType.RGB);
calcImageSize(bufferedImage);
ByteArrayOutputStream os = new ByteArrayOutputStream();
ImageIO.write(bufferedImage, "tiff", os);
os.flush();
os.close();
bufferedImage.flush();
bufferedImage = null;
fileList.add(os);
}
} finally {
doc.close();
}
return fileList;
}
{code}
I'm trying to run a sample test which runs this concurrent with 5-6 different
threads, but the app is crashing very fast.
I did some memory tests, and it seems that while the input file is around 70
kb, the
{code:java}
pdfRenderer
{code}
object is around 300 MB!! no matter if i'm changing the DPI level, the object
is still very large.
in addition, only if I'm calling the GC I see the memory drops, even if I'm
closing the doc object....
Basically when I'm running my server with -Xmx6GB with 6 threads in concurrent,
after 3 runs the service is crashing....what am I missing here?
* I attached the input pdf file
> Memory issues when rendering pdf to image
> -----------------------------------------
>
> Key: PDFBOX-4739
> URL: https://issues.apache.org/jira/browse/PDFBOX-4739
> Project: PDFBox
> Issue Type: Bug
> Components: Rendering
> Affects Versions: 2.0.18
> Reporter: Lior Yaffe
> Priority: Blocker
> Attachments: linkedinceoresume.pdf
>
>
> So I'm trying to write a web service which performs OCR on an input pdf files.
> The code is very simple - convert the pdf to tiff files using PDFBox, and
> then use tesseract on the tiff files to get text.
> code is very straight forward:
>
> {code:java}
> private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
> List<ByteArrayOutputStream> fileList = new ArrayList<>();
> PDDocument doc = PDDocument.load(this.bytes);
> doc.setResourceCache(new EmptyCache());
> try {
> PDFRenderer pdfRenderer = new PDFRenderer(doc);
> for (int page = 0; page < doc.getNumberOfPages(); ++page) {
> BufferedImage bufferedImage =
> pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
> calcImageSize(bufferedImage);
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> ImageIO.write(bufferedImage, "tiff", os);
> os.flush();
> os.close();
> bufferedImage.flush();
> bufferedImage = null;
> fileList.add(os);
> }
> } finally {
> doc.close();
> }
> return fileList;
> }
> {code}
>
> I'm trying to run a sample test which runs this concurrent with 5-6 different
> threads, but the app is crashing very fast.
>
> I did some memory tests, and it seems that while the input file is around 70
> kb, the
> {code:java}
> pdfRenderer
> {code}
> object is around 300 MB!! no matter if i'm changing the DPI level, the object
> is still very large.
> in addition, only if I'm calling the GC I see the memory drops, even if I'm
> closing the doc object....
>
> Basically when I'm running my server with -Xmx6GB with 6 threads in
> concurrent, after 3 runs the service is crashing....what am I missing here?
>
> * I attached the input pdf file
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]