Scanned books?

No help on training or choosing datasets, but, if these images are
photoscanned book pages, did you run the images through book specific
processing software (scantailor, spreads, or bookscan wizard are the 3 I
know of, plus internet archive's scan tool scripts) to split your source
images into a mixed raster type and enhance the text with a thresholding
algorithm? The thresholding algorithm (especially if you play around a bit
with the variables) can be extremely helpful if the lighting was a bit
uneven or other issues making it a little tough for tesseract to see the
pixels that make up your letters as part of the letters

On Thu, Apr 18, 2024, 10:52 testcoal <testcoal...@gmail.com> wrote:

> Hi,
> I've been utilizing Tesseract 4 to extract text from PNG and TIFF images,
> and all the content is in German. While the image quality is pretty decent,
> the extraction results have been less than stellar for some of them. I
> understand that to improve OCR accuracy, training Tesseract with additional
> data is recommended.
>
> However, I've hit a roadblock as I only have the images without the exact
> text (ground truth) or bounding boxes. Creating this data manually seems
> like a massive undertaking—do you recommend this as the best course of
> action? Or, are there other solutions or perhaps existing prepared datasets
> for German that I could use?
>
> Also, I'm curious about the volume of training data required. Is there a
> minimum number of images and corresponding texts that you'd consider
> sufficient to start seeing improved results?
>
> Any guidance or resources you can provide would be greatly appreciated.
>
> Atef
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/aeed1be3-e759-454f-89b5-ff3f0282d9a8n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/aeed1be3-e759-454f-89b5-ff3f0282d9a8n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6TKwbq7BjLMfxcXO%2B3j76xkNq1sdPGWXkExfvZAPOkovQ%40mail.gmail.com.

Reply via email to