[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950218#comment-13950218
]
Anurag Indu commented on TIKA-93:
---------------------------------
Hello All, I tried to use tesseract to extract all the images from the pdf and
convert them to their text values. I am using a windows 8 laptop with i5, 8GB
Ram and it takes 15 mins to process a single pdf. Could someone point to me to
the issue with the code (added below). Where can i improve the performance. I
am not using threading here.
List<?> pages = document.getDocumentCatalog().getAllPages();
Iterator<?> iter = pages.iterator();
StringBuilder text = new StringBuilder();
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map<String, PDXObjectImage> pageImages =
resources.getImages();
if (pageImages != null) {
Iterator<String> imageIter =
pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage)
pageImages.get(key);
image.write2file(key);
Runtime rt = Runtime.getRuntime();
String command = "\""+ tessPath +"\" \""
+ key + ".tiff\" out";
Process pr = rt.exec(command);
try {
result = pr.waitFor();
} catch (InterruptedException e) {
e.printStackTrace();
}
if (result == 0) {
String x = readFile("out.txt",
Charset.defaultCharset());
text.append(x);
}
new File(key + ".tiff").delete();
new File("out.txt").delete();
}
}
}
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.6
>
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.2#6252)