Thanks for the tip.  I'll look into this

On Thursday, April 3, 2025 at 12:12:52 PM UTC-4 Ajg wrote:

> I have an OCR program that tries to read and interpret many documents of 
> different composition.  Some documents are pdfs that have an image as the 
> first page with text on the second (or later) pages.   When processing, it 
> can take several minutes or more  just to get past the first page of the 
> pdf on the GetText() call when it is an image with little or no text on 
> it.  The application is .net based on Winforms. Pdf Pages with lots of text 
> work fine.  
>
> The relevant code in c# is 
> var ocr = new TesseractEngine(..."tessdata5.2",
>                                            "eng",
>                                            EngineMode.LstmOnly);
> using var page = ocr.Process(img, PageSegMode.AutoOsd);
> ocrtext = page.GetText();   /* long time here */
>
> img img = PixConverter.ToPix(save_bitmap);
>
> I do need to collect text from subsequent pages for indexing documents. 
>
> Thanks in advance for any comments you may have.  
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/36a17995-24c7-4dfb-a86c-a928e67cd54dn%40googlegroups.com.

Reply via email to