It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature
Try specifying a specific language? This parameter seems like a possible association (due to the description containing glyph): segment_penalty_dict_nonword 1.25 Score multiplier for glyph fragment segmentations which do not match a dictionary word (lower is better). Let me know what you find. I had this occur recently but have been chasing other issues and haven't verified a solution. On Saturday, September 3, 2016 at 5:23:55 AM UTC-4, Brais Gabín Moreira wrote: > > Hi, I'm trying to train tesseract. But text2image creates a single box for > 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character > instead of two? How can I fix this? > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d0e43a06-9f9a-4de8-9cf1-965f898cea8c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.