[tesseract-ocr] "Tesseract --psm Impact on Scanned Books with Two Facing Pages"

Mahmoud Mohamed Thu, 20 Feb 2025 20:38:27 -0800


*Understanding Tesseract OCR and --psm: Why Removing It Can Improve 
Accuracy for Scanned Books*
Introduction

Tesseract OCR is a powerful tool for extracting text from images, but
selecting the right parameters is critical for accuracy. One commonly
misunderstood parameter is --psm (Page Segmentation Mode). In this guide,
we'll discuss a real-world issue where --psm caused incorrect text
extraction for scanned books and how removing it led to *better results*.
The Problem: Mixed Text from Two Facing Pages

Many scanned books and documents contain *two facing pages in a single
image*. When processed with --psm, Tesseract sometimes *misinterprets the
structure*, causing the extracted text to be jumbled. Instead of reading
one page at a time, Tesseract would mix text from both pages, extracting:
Line 1 from right page + Line 1 from left page Line 2 from right page +
Line 2 from left page ...

This happens because --psm forces Tesseract to assume a specific layout,
which can *conflict with the actual structure of the scanned document*.
The Solution: Removing --psm

By removing --psm, Tesseract *processed the right page first in order*,
then moved to the left page. This resulted in a natural reading order and a
significantly better OCR result:
Line 1 from right page Line 2 from right page ... (Line 1 from left page
follows after the right page is complete)

This confirms that *in some cases, manually setting --psm can do more harm
than good*.
When to Avoid --psm

- When processing scanned *books or documents with two facing pages*.
- When *text is misaligned or mixed* in the OCR output.
- When dealing with *complex layouts* where Tesseract's automatic
handling works better.

When to Use --psm

There are cases where --psm is still useful, such as:

- Single-column printed text (--psm 6)
- Sparse text (--psm 11)
- Images containing only a single word (--psm 8)

Recommended OCR Settings

For scanned books or multi-column text, a safer approach is:
pytesseract.image_to_string(image, config='--oem 1 -c
preserve_interword_spaces=1')

This avoids forcing a layout assumption while keeping Tesseract optimized
for text extraction. Users can specify the language(s) as needed (e.g., -l
eng, -l ara, or -l ara+eng).
Conclusion

This discovery highlights why *experimentation is key* when working with
OCR. If your text output appears mixed or out of order, try *removing --psm*
and letting Tesseract handle the layout automatically. Hopefully, this
guide helps others facing similar issues!
Have you encountered other OCR challenges? Share your experience in the
comments or discussion forums!

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/b52ea3c1-5dda-4ba3-9034-98a89718bc88n%40googlegroups.com.

[tesseract-ocr] "Tesseract --psm Impact on Scanned Books with Two Facing Pages"

Reply via email to