Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
OK, so I whipped up a program that uses Pango to get character metrics information for a given font, of the sort that is useful for Tesseract's unicharset file. It takes a file with UTF-8 characters separated by newlines, and a font description (in the same format as you provide to text2image;

[tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Paul
Maybe the numbers you are complaining about come from the possible use of "old style numerals" like the font Georgia has them. (see old-style-numerals.png) But this is only a guess. Am Freitag, 4. Juli 2014 06:40:51 UTC+2 schrieb Albrecht Hilker: > > Hello > > Generally it is very sad that there

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread pete ballsack
Nick, In searching I found out what was causing that crash. When I combined my files to make that particular trainieddata file I omitted the shapetable. I recombined them with the shapetable and it doesnt crash on the default psm. In regards to the uzn files, I double checked and there arent any

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Alex Ryan
Nick, In searching I found out what was causing that crash. When I combined my files to make that particular trainieddata file I omitted the shapetable. I recombined them with the shapetable and it doesnt crash on the default psm. In regards to the uzn files, I double checked and there arent a

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White
On Tue, Jul 08, 2014 at 10:36:50PM -0700, Alex Ryan wrote: > In one of the links tho I saw something about -psm setting. When I run the OCR > with -psm 6 all of a sudden it worked perfect!!! Im really not sure what that > setting does, ive tried doing some searches, but im still unclear. Can you

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
I have more thoughts to the unicharset metrics discussion. > So this example says that > the character "1" has a min_bottom value of 59 and > the character "9" has a min_bottom value of 18. > > Weird ? ? ? > Both numbers are aligned to the baseline! I am guessing now (I'll take a look at the cod

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
On Sat, Jul 05, 2014 at 03:34:05PM -0700, Albrecht Hilker wrote: > Hello zdenop > > It is clear that you are not the right person to answer this question. > If YOU would ever have looked into the source code you have seen that these > values ARE in use (in version 3.03). You're being pretty unfai

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
I'm just going to go through your numbered points here. On Fri, Jul 04, 2014 at 10:02:43AM -0700, Albrecht Hilker wrote: > 1.) > The column "other_case" should contain the ID of the other-case letter. > For the lowercase letters they point correctly to the uppercase letters. > But the uppercase le

Re: [tesseract-ocr] Any way to prevent contextual digits<->letters flipping ?

2014-07-10 Thread Nick White
Hi, I haven't tried it, but quickly grepping around the source code suggests setting the config variable "crunch_include_numerals" to true might do the job. Please let us know if that works. Nick On Wed, Jul 09, 2014 at 11:15:10PM -0700, Damien D wrote: > Hi everyone, > > tesseract seems to

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White
Hi Alex, One quick thought, if you're still using .uzn, it's only loaded with certain psm levels (it is with -psm 6, but not -psm 3, the default). And it's loaded from .uzn. So if you have any .uzn files lying around, they will be being applied with psm 6, but not if you don't explicitly stat

[tesseract-ocr] Any way to prevent contextual digits<->letters flipping ?

2014-07-10 Thread Damien D
Hi everyone, tesseract seems to sometime use the closest characters to "guess" what will be the next one. Let me explain that with an exemple: I want to parse a picture that contains the following sequence of characters SE3P-104168 but most of the time the output will be SESP-104168 I believe

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Alex Ryan
Paul, I havent gotten a chance to play around with that yet, but thanks for linking that, I might very well have to go that route. I am having a very confusing issue tho that Im hoping maybe someone can shed some light on. I've been testing out my language traineddata on a bunch of different b

[tesseract-ocr] Failed to get the text

2014-07-10 Thread Fajar Faqih
i'm trying to convert image text to text in android using tesseract.API here's the image and the result