On Thu, Mar 22, 2012 at 12:59 PM, Demian Katz <demian.k...@villanova.edu> wrote: > Hello, > > I'm using Tesseract 3 as a simple command-line tool to generate OCR. > It's doing a fairly good job, but I have one unmet need -- I need to > be able to separate paragraphs with blank lines. It would be great if > Tesseract could do this for me, but I'd even be happy if it could > include indentation whitespace in the text so I could perform the > splitting using my own software. > > Is there any way to achieve this effect?
One choice is to dump out hocr instead of just UTF8 text. So do: tesseract test.tif test hocr where hocr is the name of the built-in config file that is in tessdata/configs. This will generate test.html instead of test.txt. See [1] for a bit more info on hOCR. If you aren't afraid of doing some programming, look at the code for TessBaseAPI::GetHOCRText [2]. It uses res_it->IsAtBeginningOf(RIL_PARA) to figure out where each paragraph begins. > On a somewhat related note, > is there any way to control Tesseract's command line behavior at all? > I see that it accepts a config file as a command-line option, but I'm > having no luck finding documentation on what options are available or > what they mean -- the provided examples don't actually seem to work, > and even searching the code hasn't given me anything resembling a list > of valid options. > > Any help or pointers in the right direction would be greatly > appreciated! > > thanks, > Demian AFAIK there aren't any good docs on config files yet (I'm working on that). But look in tessdata/configs & tessdata/tessconfigs for example config files. To get a list of possible config file parameters, see this thread [3], in particular this message by me [4]. [1] http://en.wikipedia.org/wiki/HOCR [2] http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.cpp#932 [3] http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2eda8cda1d5557c1/ [4] http://groups.google.com/group/tesseract-ocr/msg/73565d039201f2e6 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en