Scanned books? No help on training or choosing datasets, but, if these images are photoscanned book pages, did you run the images through book specific processing software (scantailor, spreads, or bookscan wizard are the 3 I know of, plus internet archive's scan tool scripts) to split your source images into a mixed raster type and enhance the text with a thresholding algorithm? The thresholding algorithm (especially if you play around a bit with the variables) can be extremely helpful if the lighting was a bit uneven or other issues making it a little tough for tesseract to see the pixels that make up your letters as part of the letters
On Thu, Apr 18, 2024, 10:52 testcoal <testcoal...@gmail.com> wrote: > Hi, > I've been utilizing Tesseract 4 to extract text from PNG and TIFF images, > and all the content is in German. While the image quality is pretty decent, > the extraction results have been less than stellar for some of them. I > understand that to improve OCR accuracy, training Tesseract with additional > data is recommended. > > However, I've hit a roadblock as I only have the images without the exact > text (ground truth) or bounding boxes. Creating this data manually seems > like a massive undertaking—do you recommend this as the best course of > action? Or, are there other solutions or perhaps existing prepared datasets > for German that I could use? > > Also, I'm curious about the volume of training data required. Is there a > minimum number of images and corresponding texts that you'd consider > sufficient to start seeing improved results? > > Any guidance or resources you can provide would be greatly appreciated. > > Atef > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/aeed1be3-e759-454f-89b5-ff3f0282d9a8n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/aeed1be3-e759-454f-89b5-ff3f0282d9a8n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6TKwbq7BjLMfxcXO%2B3j76xkNq1sdPGWXkExfvZAPOkovQ%40mail.gmail.com.