Re: Ugly behavior when recognizing – advice requirement

Dmitri Silaev Tue, 07 May 2013 01:50:20 -0700

Andres,

Your code seems to be correct. I personally use a few more lines right
after the call to GetIterator():
    it->Begin();
    if(it->IsAtFinalElement(RIL_BLOCK, RIL_SYMBOL))
        return;
    if(!it->IsAtBeginningOf(RIL_SYMBOL))
        return;
But this shouldn't bother you if you rely on non-degenerate cases.


Well, I suggest using revision 724. It is battle-tested by me and
probably contains less bugs and has better balance between accuracy
and speed compared to any newer revision. Although newer ones may
introduce many fancy features, I'll refrain of using them in
production. Maybe this can help you.

Warm regards,
Dmitri Silaev
www.CustomOCR.com


On Mon, May 6, 2013 at 9:28 AM, Andres <andrej...@gmail.com> wrote:
> Answering part of what I asked last, I've found a way of getting the
> alternatives to each char, but seems to be not working in 3.01 according to
> what I tested and
> http://code.google.com/p/tesseract-ocr/issues/detail?id=714
> My snippet:
>
> #include <api/resultiterator.h>
>
> ...
>
> tess_api.SetVariable("save_blob_choices", "T");
>
> ...
>
>
> tesseract::ResultIterator* it = tess_api.GetIterator();
>
> do
> {
>     char* uval = it->GetUTF8Text(tesseract::RIL_SYMBOL);
>     cout<<uval<<"("<<it->Confidence(tesseract::RIL_SYMBOL)<<"){";
>     tesseract::ChoiceIterator ci(*it);
>     do
>     {
>         const char* val = ci.GetUTF8Text();
>         cout<<" "<<(val == NULL ? "#" : val)<<" "<<ci.Confidence();
>     }
>     while (ci.Next());
>     cout<<"}";
> }
> while (it->Next(tesseract::RIL_SYMBOL));
>
>
>
>
>
> El lunes, 6 de mayo de 2013 01:50:42 UTC-3, Andres escribió:
>>
>> Hi Dmitri,
>>
>> Many thanks for your hints, as always.
>>
>> Regarding the links in my previous message, sorry for that, I'll repost
>> the entire message below this message, fixed.
>>
>> I like the method that you tell that you use in CustomOCR. Is there a way
>> of getting the character variants without making a hack ? As I saw, the
>> interface of the API just exposes the confidence level for each character.
>> Am I right with this ?
>>
>> Regarding psm mode, I'm using this from insinde my code with value 7,
>> which is for "Treat the image as a single text line". Is that the parameter
>> that you are suggesting me ?
>>
>> Anyway, I think that I might have big newbie errors in my training, so I
>> will be grateful if you just see my training image and my problematic image,
>> to know if you see an obvious error at first sight.
>>
>> My training image:
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>>
>> Problematic image (a "6" recognized as a "5"):
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>>
>> Another problematic image ("A A" recognized as "M")
>> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit
>>
>> The following is my original message with the links fixed:
>>
>> Dear people,
>>
>> I trained Tesseract for my font (FE-Schrift:
>> http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad results.
>> I am using Tesseract 3.01 under Windows.
>>
>> In this image:
>>
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
>>
>> Where text is SAA5298 I’m getting SM529B, this is being done from inside a
>> program and I know that the “M” from the result is the result of the “AA” of
>> the source.  So, Tesseract is making a very bad segmentation of these two
>> characters, and even they are very good separated, as you can see.  Do you
>> have an idea about why is this happening ? In the other hand, is there a way
>> to give tesseract a hint for this (e.g., telling it the character width).
>>
>> The other problem is with this one:
>>
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>>
>> Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”,
>> even when the image is very good.
>>
>>  Here is my fonts training file:
>>
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>>
>> Here is my box file:
>>
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
>>
>> Here is my .traineddata file:
>>
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
>>
>> And here is a .cmd file for testing these 2 images:
>>
>>
>> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
>>
>>
>>
>> Thanks,
>>
>> Andres
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió:
>>>
>>> Andres,
>>>
>>> Above all, your first link seem to be pointing to a "traineddata" file
>>> instead of an image. Second, without actually diving deep into your
>>> problem, I can suggest specifying the single line psm mode in the
>>> command line. And finally you can use the user patterns feature to
>>> restrict possible output of Tesseract (for the format see comments in
>>> dict/trie.h on read_pattern_list()). Another way of achieving the
>>> latter, like we do in CustomOCR, and it seems to be more reliable, is
>>> to use the API to get a number of of character variants for each blob
>>> alng with confidence levels and match them against a set of possible
>>> patterns. You can find how to do this by searching around this forum.
>>>
>>> HTH and good luck with Tesseract!
>>>
>>> Warm regards,
>>> Dmitri Silaev
>>> www.CustomOCR.com
>>>
>>>
>>> On Fri, May 3, 2013 at 8:24 PM, Andres <andr...@gmail.com> wrote:
>>> > Dear people,
>>> >
>>> > I trained Tesseract for my font (FE-Schrift:
>>> > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad
>>> > results.
>>> > I am using Tesseract 3.01 under Windows.
>>> >
>>> > In this image:
>>> >
>>> >
>>> > https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing
>>> >
>>> > Where text is SAA5298 I’m getting SM529B, this is being done from
>>> > inside a
>>> > program and I know that the “M” from the result is the result of the
>>> > “AA” of
>>> > the source.  So, Tesseract is making a very bad segmentation of these
>>> > two
>>> > characters, and even they are very good separated, as you can see.  Do
>>> > you
>>> > have an idea about why is this happening ? In the other hand, is there
>>> > a way
>>> > to give tesseract a hint for this (e.g., telling it the character
>>> > width).
>>> >
>>> > The other problem is with this one:
>>> >
>>> >
>>> > https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>>> >
>>> > Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”,
>>> > even
>>> > when the image is very good.
>>> >
>>> >
>>> >
>>> > Here is my fonts training file:
>>> >
>>> >
>>> > https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
>>> >
>>> > Here is my box file:
>>> >
>>> >
>>> > https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
>>> >
>>> > Here is my .traineddata file:
>>> >
>>> >
>>> > https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
>>> >
>>> > And here is a .cmd file for testing these 2 images:
>>> >
>>> >
>>> > https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
>>> >
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Andres
>>> >
>>> > --
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> > Groups "tesseract-ocr" group.
>>> > To post to this group, send email to tesser...@googlegroups.com
>>> > To unsubscribe from this group, send email to
>>> > tesseract-oc...@googlegroups.com
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/tesseract-ocr?hl=en
>>> >
>>> > ---
>>> > You received this message because you are subscribed to the Google
>>> > Groups
>>> > "tesseract-ocr" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> > an
>>> > email to tesseract-oc...@googlegroups.com.
>>> > For more options, visit https://groups.google.com/groups/opt_out.
>>> >
>>> >
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Ugly behavior when recognizing – advice requirement

Reply via email to