[
https://issues.apache.org/jira/browse/PDFBOX-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014075#comment-17014075
]
Lior Yaffe commented on PDFBOX-4739:
------------------------------------
This is the code which runs the multi-threading (it runs with ):
{code:java}
for (int i = 0; i < 15; i++) {
Thread thread = new Thread(new Runnable() {
@Override
public void run() {
String filePath = "linkedinceoresume.pdf";
File pdfFile = new File(filePath);
byte[] bytes = new byte[0];
try {
bytes = Files.readAllBytes(pdfFile.toPath());
} catch (IOException e) {
e.printStackTrace();
}
try {
String text = doOCR();
} catch (Exception e) {
e.printStackTrace();
}
}
});
thread.start();
}
{code}
it runs with
{code:java}
-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider -Xmx4G
{code}
and the output is:
{code:java}
Before....Used memory: 48 MBBefore....Used memory: 48 MBBefore....Used memory:
49 MBBefore....Used memory: 49 MBBefore....Used memory: 49 MBBefore....Used
memory: 50 MBBefore....Used memory: 51 MBBefore....Used memory: 51
MBBefore....Used memory: 51 MBBefore....Used memory: 51 MBBefore....Used
memory: 51 MBBefore....Used memory: 51 MBBefore....Used memory: 52
MBBefore....Used memory: 53 MBBefore....Used memory: 53 MBBefore....Used
memory: 62 MBAfter....Used memory: 2883 MBAfter....Used memory: 2987
MBAfter....Used memory: 2791 MBAfter....Used memory: 2849 MBAfter....Used
memory: 2883 MBAfter....Used memory: 2910 MBAfter....Used memory: 2982
MBAfter....Used memory: 3226 MBAfter....Used memory: 3281 MBAfter....Used
memory: 3306 MBAfter....Used memory: 3157 MBAfter....Used memory: 3157
MBAfter....Used memory: 3159 MBAfter....Used memory: 3157 MBAfter....Used
memory: 3159 MBException in thread "Thread-5" java.lang.OutOfMemoryError: Java
heap spaceException in thread "Thread-8" java.lang.OutOfMemoryError: Java heap
spaceException in thread "Thread-7" java.lang.OutOfMemoryError: Java heap
spaceException in thread "Thread-14" java.lang.OutOfMemoryError: Java heap
space at java.awt.image.DataBufferByte.<init>(DataBufferByte.java:92) at
java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:445)
at java.awt.image.Raster.createWritableRaster(Raster.java:941) at
javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1074)
at javax.imageio.ImageReader.getDestination(ImageReader.java:2892) at
com.github.jaiimageio.impl.plugins.tiff.TIFFImageReader.read(TIFFImageReader.java:1161)
at javax.imageio.ImageIO.read(ImageIO.java:1448) at
net.sourceforge.tess4j.util.ImageIOHelper.getImageByteBuffer(ImageIOHelper.java:301)
at net.sourceforge.tess4j.Tesseract.setImage(Tesseract.java:394) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:287) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:260) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:241) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:225) at
com.ziprecruiter.ocr.tesseract.TesseractOCR.getTextFromTiff(TesseractOCR.java:239)
at com.ziprecruiter.ocr.tesseract.TesseractOCR.doOCR(TesseractOCR.java:82) at
com.ziprecruiter.MemoryTest$2.run(MemoryTest.java:76) at
java.lang.Thread.run(Thread.java:748)Exception in thread "Thread-10"
java.lang.OutOfMemoryError: Java heap space at
java.awt.image.DataBufferByte.<init>(DataBufferByte.java:92) at
java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:445)
at java.awt.image.Raster.createWritableRaster(Raster.java:941) at
javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1074)
at javax.imageio.ImageReader.getDestination(ImageReader.java:2892) at
com.github.jaiimageio.impl.plugins.tiff.TIFFImageReader.read(TIFFImageReader.java:1161)
at javax.imageio.ImageIO.read(ImageIO.java:1448) at
net.sourceforge.tess4j.util.ImageIOHelper.getImageByteBuffer(ImageIOHelper.java:301)
at net.sourceforge.tess4j.Tesseract.setImage(Tesseract.java:394) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:287) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:260) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:241) at
net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:225) at
java.lang.Thread.run(Thread.java:748)ObjectCache(0x12d2a63f8)::~ObjectCache():
WARNING! LEAK! object 0x7fa8cf10e780 still has count 10 (id
/usr/local/share/tessdata/eng.traineddatapunc-dawg)ObjectCache(0x12d2a63f8)::~ObjectCache():
WARNING! LEAK! object 0x7fa8cf10e860 still has count 10 (id
/usr/local/share/tessdata/eng.traineddataword-dawg)ObjectCache(0x12d2a63f8)::~ObjectCache():
WARNING! LEAK! object 0x7fa8ccb7dda0 still has count 10 (id
/usr/local/share/tessdata/eng.traineddatanumber-dawg)ObjectCache(0x12d2a63f8)::~ObjectCache():
WARNING! LEAK! object 0x7fa8ccb7dca0 still has count 10 (id
/usr/local/share/tessdata/eng.traineddatabigram-dawg)ObjectCache(0x12d2a63f8)::~ObjectCache():
WARNING! LEAK! object 0x7fa8cee645a0 still has count 10 (id
/usr/local/share/tessdata/eng.traineddatafreq-dawg)
Process finished with exit code 130 (interrupted by signal 2: SIGINT)
{code}
I still don't understand how doing ocr on 15 70 kb files are consuming more
than 3GB of memory...
> Memory issues when rendering pdf to image
> -----------------------------------------
>
> Key: PDFBOX-4739
> URL: https://issues.apache.org/jira/browse/PDFBOX-4739
> Project: PDFBox
> Issue Type: Bug
> Components: Rendering
> Affects Versions: 2.0.18
> Reporter: Lior Yaffe
> Priority: Blocker
> Attachments: linkedinceoresume.pdf
>
>
> So I'm trying to write a web service which performs OCR on an input pdf files.
> The code is very simple - convert the pdf to tiff files using PDFBox, and
> then use tesseract on the tiff files to get text.
> code is very straight forward:
>
> {code:java}
> private List<ByteArrayOutputStream> convertPdfToTiff2() throws IOException {
> List<ByteArrayOutputStream> fileList = new ArrayList<>();
> PDDocument doc = PDDocument.load(this.bytes);
> doc.setResourceCache(new EmptyCache());
> try {
> PDFRenderer pdfRenderer = new PDFRenderer(doc);
> for (int page = 0; page < doc.getNumberOfPages(); ++page) {
> BufferedImage bufferedImage =
> pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
> calcImageSize(bufferedImage);
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> ImageIO.write(bufferedImage, "tiff", os);
> os.flush();
> os.close();
> bufferedImage.flush();
> bufferedImage = null;
> fileList.add(os);
> }
> } finally {
> doc.close();
> }
> return fileList;
> }
> {code}
>
> I'm trying to run a sample test which runs this concurrent with 5-6 different
> threads, but the app is crashing very fast.
>
> I did some memory tests, and it seems that while the input file is around 70
> kb, the
> {code:java}
> pdfRenderer
> {code}
> object is around 300 MB!! no matter if i'm changing the DPI level, the object
> is still very large.
> in addition, only if I'm calling the GC I see the memory drops, even if I'm
> closing the doc object....
>
> Basically when I'm running my server with -Xmx6GB with 6 threads in
> concurrent, after 3 runs the service is crashing....what am I missing here?
>
> * I attached the input pdf file
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]