Thanks for the tip. I'll look into this On Thursday, April 3, 2025 at 12:12:52 PM UTC-4 Ajg wrote:
> I have an OCR program that tries to read and interpret many documents of > different composition. Some documents are pdfs that have an image as the > first page with text on the second (or later) pages. When processing, it > can take several minutes or more just to get past the first page of the > pdf on the GetText() call when it is an image with little or no text on > it. The application is .net based on Winforms. Pdf Pages with lots of text > work fine. > > The relevant code in c# is > var ocr = new TesseractEngine(..."tessdata5.2", > "eng", > EngineMode.LstmOnly); > using var page = ocr.Process(img, PageSegMode.AutoOsd); > ocrtext = page.GetText(); /* long time here */ > > img img = PixConverter.ToPix(save_bitmap); > > I do need to collect text from subsequent pages for indexing documents. > > Thanks in advance for any comments you may have. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/36a17995-24c7-4dfb-a86c-a928e67cd54dn%40googlegroups.com.

