On Thu, Mar 22, 2012 at 12:59 PM, Demian Katz <demian.k...@villanova.edu> wrote:
> Hello,
>
> I'm using Tesseract 3 as a simple command-line tool to generate OCR.
> It's doing a fairly good job, but I have one unmet need -- I need to
> be able to separate paragraphs with blank lines.  It would be great if
> Tesseract could do this for me, but I'd even be happy if it could
> include indentation whitespace in the text so I could perform the
> splitting using my own software.
>
> Is there any way to achieve this effect?

One choice is to dump out hocr instead of just UTF8 text. So do:

   tesseract test.tif test hocr

where hocr is the name of the built-in config file that is in
tessdata/configs. This will generate test.html instead of test.txt.
See [1] for a bit more info on hOCR.

If you aren't afraid of doing some programming, look at the code for
TessBaseAPI::GetHOCRText [2]. It uses
res_it->IsAtBeginningOf(RIL_PARA) to figure out where each paragraph
begins.

> On a somewhat related note,
> is there any way to control Tesseract's command line behavior at all?
> I see that it accepts a config file as a command-line option, but I'm
> having no luck finding documentation on what options are available or
> what they mean -- the provided examples don't actually seem to work,
> and even searching the code hasn't given me anything resembling a list
> of valid options.
>
> Any help or pointers in the right direction would be greatly
> appreciated!
>
> thanks,
> Demian

AFAIK there aren't any good docs on config files yet (I'm working on
that). But look in tessdata/configs & tessdata/tessconfigs for example
config files. To get a list of possible config file parameters, see
this thread [3], in particular this message by me [4].

[1] http://en.wikipedia.org/wiki/HOCR

[2] 
http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.cpp#932

[3] 
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2eda8cda1d5557c1/

[4] http://groups.google.com/group/tesseract-ocr/msg/73565d039201f2e6

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to