[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950218#comment-13950218
 ] 

Anurag Indu commented on TIKA-93:
---------------------------------

Hello All, I tried to use tesseract to extract all the images from the pdf and 
convert them to their text values. I am using a windows 8 laptop with i5, 8GB 
Ram and it takes 15 mins to process a single pdf. Could someone point to me to 
the issue with the code (added below). Where can i improve the performance. I 
am not using threading here.
List<?> pages = document.getDocumentCatalog().getAllPages();
                Iterator<?> iter = pages.iterator();
                StringBuilder text = new StringBuilder();
                while (iter.hasNext()) {
                        PDPage page = (PDPage) iter.next();
                        PDResources resources = page.getResources();
                        Map<String, PDXObjectImage> pageImages = 
resources.getImages();
                        if (pageImages != null) {
                                Iterator<String> imageIter = 
pageImages.keySet().iterator();
                                while (imageIter.hasNext()) {
                                        String key = (String) imageIter.next();
                                        PDXObjectImage image = (PDXObjectImage) 
pageImages.get(key);
                                        image.write2file(key);
                                        Runtime rt = Runtime.getRuntime();
                                        String command = "\""+ tessPath +"\" \""
                                                        + key + ".tiff\" out";
                                        Process pr = rt.exec(command);
                                        try {
                                                result = pr.waitFor();
                                        } catch (InterruptedException e) {
                                                e.printStackTrace();
                                        }
                                        if (result == 0) {
                                                String x = readFile("out.txt", 
Charset.defaultCharset());
                                                text.append(x);
                                        }

                                        new File(key + ".tiff").delete();
                                        new File("out.txt").delete();
                                }
                        }
                }

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to