*Short version:* Ghostscipt uses Tesseract, but their data exchange 
interface may contain a bug. However, their developers are not convinced 
it's really a bug, so I'm trying to find more evidence here.

*Long version:* Ghostscript now has the ability to perform OCR on documents 
via Tesseract. It has a really nifty feature you don't have to flatten the 
document <https://ghostscript.com/docs/9.54.0/Devices#PDFwriteocr> to 
bitmap first, which is generally undesirable. Instead, Ghostscript takes 
vector text (its glyphs), renders a small portion of them to bitmap and 
feeds it to Tesseract. Then it takes the resulting character codes and 
assigns them to original vector glyphs, thus preserving the  vector content 
of the document. I tried to use this feature to fix old PDF files that have 
completely garbled text encoding, i.e. their text looks fine on screen, but 
total garbage ("mojibake") is returned when I try to copy and paste from 
them. 

It works surprisingly well, but I noticed one oddity: sometimes Tesseract 
returns characters from very exotic languages, even though the document's 
language is specified. In my case, the document is Czech, but certain 
characters are consistently returned as Ol Chiki or Hangul (Korean 
alphabet). My original bug reports contains concrete examples and a 
suprisingly detailed reply from one Ghostscript developer. It would be 
pointless to repeat it, so please look here:

https://bugs.ghostscript.com/show_bug.cgi?id=708548

Do you think he is right? I checked GS source code, but couldn't glean 
which --psm setting they use. I assume it's 7 (single line) or 8 (single 
word). Can Tesseract return characters from totally different alphabets 
with this setting? I tried to google it of course, but found nothing 
conclusive. Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/b62d453a-d3d9-487f-8a4c-843898e1f092n%40googlegroups.com.

Reply via email to