Thank you for your reply, and please forgive my delay, it took me much longer to finish preprocessing my images than I anticipated (actually, was lead to believe it would take - but probably because I'm working with a textbook type layout and not a novel type layout right now).
To confirm, you are suggesting a run with --oem 0 set and a second run with --oem 1 set and then compare the results, correct? On Mon, Mar 25, 2024, 05:50 Ger Hobbelt <g...@hobbelt.com> wrote: > In your scenario, I would check performance of both modern lstm (v4/v5 > engine) and old "classic" v3 OCR engine in tesseract. Just for completeness > sake; first tests would be in separate runs so I'ld be able to check the > output quality of both runs into HOCR format. (2 separate runs so I don't > have to bother within tesseract internal heuristic to "pick the best one" > and only dump that one: if I were you I'ld want to see both processes' > performance and decide what to do after that. > > Postprocessing is akin to "fixing it in the mix": you only do that when > all other options have been depleted. > > > On Sun, 24 Mar 2024, 19:29 Misti Hamon, <mistiha...@gmail.com> wrote: > >> I'm going to preface this with, I haven't actually done an OCR run yet >> (by the time any replies come in, I probably will have, the source image >> editing is almost done). >> >> I'm working with some photoscanned images of knitting related work (so, >> there are some non-word characters and acronyms used, most are still >> English but there are occasional symbols, some standard ascii or unicode, >> others specialty - I should be able to exclude the specialty symbols and >> keep them as an image, or at least I hope so), based on tesseract being a >> "groups of words" based recognition, it sounds like this might produce >> unexpected results? (example of a line that might show up that could >> cause a problem would be - K2, yo, k2tog, k to last 4, ssk, yo, k2 - >> doesn't look like English words, kind of looks like a sentence *if* you >> assume a space or comma denotes a that which came before is a word) >> >> So, in order to handle/fix stuff like that, without training, I'm looking >> for tips on how to inspect my hOCR files to verify and, if necessary, >> correct the results, that work on linux without running wine. I am looking >> into the tools suggested in the "Post OCR Verification and Editing" >> conversation, but that poster is on windows, with a different toolchain, >> so, not sure all apply to me. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frCE7PR_%3DBPpKKhYfmK1CPpqs4KbLUGEYH-WWkGBtPAEg%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frCE7PR_%3DBPpKKhYfmK1CPpqs4KbLUGEYH-WWkGBtPAEg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6TcNya959P2V124HR4h_Aj32D%3Dx5qA8Xxc1vPByTw6xmg%40mail.gmail.com.