To improve the accuracy of text extraction, you can preprocess the image
before passing it to the OCR engine. Preprocessing techniques like
converting the image to grayscale, enhancing contrast, or applying filters
can help reduce noise and improve readability. Additionally, tweaking the
pytesseract settings like changing the --psm value may also improve the
results.
Here’s an updated version of your code with some preprocessing steps:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
pytesseract.pytesseract.tesseract_cmd =
'C:\\Users\\M562765\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe'
# Path to your image
image_path = 'C:/Users/M562765/Downloads/Unable-images/Unable/crop1.jpg'
def extract_text_from_image(image_path):
# Open the image
img = Image.open(image_path)
# Convert the image to grayscale to improve text-background contrast
img = img.convert('L') # Convert image to grayscale
img = ImageEnhance.Contrast(img).enhance(2) # Increase contrast
img = img.filter(ImageFilter.SHARPEN) # Sharpen the image
# Use pytesseract to extract text
extracted_text = pytesseract.image_to_string(img, config='--psm 6') #
PSM 6 assumes a block of text
return extracted_text.strip()
# Extract and print text
text = extract_text_from_image(image_path)
print(f"Text extracted from {image_path}: {text}")
في الاثنين، ٢٥ نوفمبر ٢٠٢٤، ٤:١٢ م Taresh Chaudhari <
[email protected]> كتب:
> Attaching a image for reference.
>
> On Monday, 25 November 2024 at 15:52:27 UTC+5:30 Taresh Chaudhari wrote:
>
>> Hi,
>> I am trying to read the characters from the image, which has characters
>> with black color in the background. Attaching the code which i used to
>> extract, currently its giving the partial output. Can you help me to guide
>> how to make it accurate?
>>
>>
>> import pytesseract
>> from PIL import Image
>> pytesseract.pytesseract.tesseract_cmd =
>> 'C:\\Users\\M562765\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe'
>> # Paths to your images
>> image_paths = [
>> 'C:/Users/M562765/Downloads/Unable-images/Unable/crop1.jpg']
>>
>> # Function to process an image and extract text
>> def extract_text_from_image(image_path):
>> # Open the image
>> img = Image.open(image_path)
>>
>> # Use pytesseract to perform OCR
>> extracted_text = pytesseract.image_to_string(img, config='--psm 6')
>> # PSM 6 assumes a block of text
>> return extracted_text.strip()
>>
>> # Process all images and print results
>> for img_path in image_paths:
>> text = extract_text_from_image(img_path)
>> print(f"Text extracted from {img_path}: {text}")
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/83985355-a349-4ed7-a2a9-c938fda1a5f4n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/83985355-a349-4ed7-a2a9-c938fda1a5f4n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsmBTxkW%3DoaK4Jfp1HB7azg%2BOst3sDYQtRfwcm7EUMAQ%2Bw%40mail.gmail.com.