[tesseract-ocr] Re: text2image creates char boxes for 'fi' and 'fl'. Why?

2016-09-08 Thread Brais Gabín Moreira
How can I set a blacklist to text2image? -c tessedit_char_blacklist=fifl 
doesn't work for me.

My problem is that text2image writes things like this:

fl 133 162 159 199 5

I tried with --ligatures=true but the result is this one:

fl 133 162 159 199 5

I'll continue with my research...

El domingo, 4 de septiembre de 2016, 22:19:34 (UTC+2), fuzzy7k escribió:
>
> My earlier successes were definitely font related. Use a blacklist, or 
> whitelist
>
> -c tessedit_char_blacklist=fifl
>
> https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/discussion
>
> On Saturday, September 3, 2016 at 1:45:21 PM UTC-4, fuzzy7k wrote:
>>
>> It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature
>>
>> Try specifying a specific language?
>>
>> This parameter seems like a possible association (due to the description 
>> containing glyph): 
>> segment_penalty_dict_nonword1.25Score multiplier for glyph 
>> fragment segmentations which do not match a dictionary word (lower is 
>> better).
>>
>> Let me know what you find. I had this occur recently but have been 
>> chasing other issues and haven't verified a solution.
>>
>>
>> On Saturday, September 3, 2016 at 5:23:55 AM UTC-4, Brais Gabín Moreira 
>> wrote:
>>>
>>> Hi, I'm trying to train tesseract. But text2image creates a single box 
>>> for 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character 
>>> instead of two? How can I fix this?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/14f3f358-15a0-4498-a3ba-cfaede57e717%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: text2image creates char boxes for 'fi' and 'fl'. Why?

2016-09-04 Thread fuzzy7k
My earlier successes were definitely font related. Use a blacklist, or 
whitelist

-c tessedit_char_blacklist=fifl

https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/discussion

On Saturday, September 3, 2016 at 1:45:21 PM UTC-4, fuzzy7k wrote:
>
> It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature
>
> Try specifying a specific language?
>
> This parameter seems like a possible association (due to the description 
> containing glyph): 
> segment_penalty_dict_nonword1.25Score multiplier for glyph 
> fragment segmentations which do not match a dictionary word (lower is 
> better).
>
> Let me know what you find. I had this occur recently but have been chasing 
> other issues and haven't verified a solution.
>
>
> On Saturday, September 3, 2016 at 5:23:55 AM UTC-4, Brais Gabín Moreira 
> wrote:
>>
>> Hi, I'm trying to train tesseract. But text2image creates a single box 
>> for 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character 
>> instead of two? How can I fix this?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35f9424b-60a6-45d5-9355-e33377052f21%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: text2image creates char boxes for 'fi' and 'fl'. Why?

2016-09-03 Thread fuzzy7k
It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature

Try specifying a specific language?

This parameter seems like a possible association (due to the description 
containing glyph): 
segment_penalty_dict_nonword1.25Score multiplier for glyph fragment 
segmentations which do not match a dictionary word (lower is better).

Let me know what you find. I had this occur recently but have been chasing 
other issues and haven't verified a solution.


On Saturday, September 3, 2016 at 5:23:55 AM UTC-4, Brais Gabín Moreira 
wrote:
>
> Hi, I'm trying to train tesseract. But text2image creates a single box for 
> 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character 
> instead of two? How can I fix this?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d0e43a06-9f9a-4de8-9cf1-965f898cea8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.