Re: [tesseract-ocr] Detected Malware latest Windows release | VirusTotal

2024-07-02 Thread Misti Hamon
I'm not one of the developers, and I don't do anything for windows anymore (this issue is one of them). It sounds like, reading through your link, that there isn't anything actually malicious, just one of the certificates has expired (or, is near expiring and your system clock is off in some way -

Re: [tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

2024-06-09 Thread Misti Hamon
Ger, Your problem set/end goal is simular to mine (textbooks/manuals not magazines and datasheets and I only have tiff or jpg images, no partial pdfs, but full text search and copy/paste are things I want, and textbooks/manuals do have the same OCR difficulties as magazines). Can't offer much

Re: [tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

2024-06-07 Thread Misti Hamon
Novels and non-fiction prose (memiors, basic history or whatever) I'm getting good runs, they also happen to use fonts that were, or are close to ones, already trained. Manuals and textbooks - most of the ones I'm trying to work with include pictures and diagrams and other elements to further

Re: [tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

2024-06-07 Thread Misti Hamon
Hello Ger, and thank you for responding. Regarding training and/or tuning - I definitely don't have the available computing power for a full train, and assuming I'm understanding the requirements (specifically the 1000 images minimum thing) I'm not sure I have enough data for a tune (it's

Re: [tesseract-ocr] Using Tesseract as an OCR solution for blind people

2024-04-30 Thread Misti Hamon
Image quality matters. Upside down or sideways images really need to be rotated first - that is easy to do without loading up an image editor, just need to get into the jpg's metadata. It sounds like you are processing text books, to turn into something a screenreader can manage? Headers and such

Re: [tesseract-ocr] Re: Textbook-like format. Correcting improperly recognized text

2024-04-29 Thread Misti Hamon
"Regarding proofreading with Scribe OCR, it is definitely possible to zoom in. The controls are virtually identical to popular document viewer programs like Acrobat. You can zoom in on the current location of the mouse using Control + Mouse Wheel, scroll using the mouse wheel, and pan in all

[tesseract-ocr] Textbook-like format. Correcting improperly recognized text

2024-04-29 Thread Misti Hamon
Forgive me, I have lots of questions and will be trying to separate out one question per conversation (so that those searching later may more easily find the answers). I'm working with scanned images of a textbook like layout - occasional drop-caps, text in 2 or occasionally 3 columns that

Re: [tesseract-ocr] hOCR verification and editing plus non-word characters

2024-04-29 Thread Misti Hamon
e: if I were you I'ld want to see both processes' > performance and decide what to do after that. > > Postprocessing is akin to "fixing it in the mix": you only do that when > all other options have been depleted. > > > On Sun, 24 Mar 2024, 19:29 Misti Hamon, wrote: &g

Re: [tesseract-ocr] Cannot run Tesseract

2024-04-27 Thread Misti Hamon
The text imagename, outputbase, lang etc are all placeholders (and anything in square brackets are optional, you don't need to include them if the defaults will give you the output you need). To run tesseract, you'll need to replace the placeholder text with the specifics of your file. On Sat,

Re: [tesseract-ocr] Train Tesseract (german)

2024-04-18 Thread Misti Hamon
Scanned books? No help on training or choosing datasets, but, if these images are photoscanned book pages, did you run the images through book specific processing software (scantailor, spreads, or bookscan wizard are the 3 I know of, plus internet archive's scan tool scripts) to split your source

[tesseract-ocr] hOCR verification and editing plus non-word characters

2024-03-24 Thread Misti Hamon
I'm going to preface this with, I haven't actually done an OCR run yet (by the time any replies come in, I probably will have, the source image editing is almost done). I'm working with some photoscanned images of knitting related work (so, there are some non-word characters and acronyms used,