Re: [tesseract-ocr] hOCR verification and editing plus non-word characters

Misti Hamon Mon, 29 Apr 2024 09:53:10 -0700

Thank you for your reply, and please forgive my delay, it took me much
longer to finish preprocessing my images than I anticipated (actually, was
lead to believe it would take - but probably because I'm working with a
textbook type layout and not a novel type layout right now).


To confirm, you are suggesting a run with --oem 0 set and a second run with
--oem 1 set and then compare the results, correct?

On Mon, Mar 25, 2024, 05:50 Ger Hobbelt <g...@hobbelt.com> wrote:

> In your scenario, I would check performance of both modern lstm (v4/v5
> engine) and old "classic" v3 OCR engine in tesseract. Just for completeness
> sake; first tests would be in separate runs so I'ld be able to check the
> output quality of both runs into HOCR format. (2 separate runs so I don't
> have to bother within tesseract internal heuristic to "pick the best one"
> and only dump that one: if I were you I'ld want to see both processes'
> performance and decide what to do after that.
>
> Postprocessing is akin to "fixing it in the mix": you only do that when
> all other options have been depleted.
>
>
> On Sun, 24 Mar 2024, 19:29 Misti Hamon, <mistiha...@gmail.com> wrote:
>
>> I'm going to preface this with, I haven't actually done an OCR run yet
>> (by the time any replies come in, I probably will have, the source image
>> editing is almost done).
>>
>> I'm working with some photoscanned images of knitting related work (so,
>> there are some non-word characters and acronyms used, most are still
>> English but there are occasional symbols, some standard ascii or unicode,
>> others specialty - I should be able to exclude the specialty symbols and
>> keep them as an image, or at least I hope so), based on tesseract being a
>> "groups of words" based recognition, it sounds like this might produce
>> unexpected results?   (example of a line that might show up that could
>> cause a problem would be - K2, yo, k2tog, k to last 4, ssk, yo, k2 -
>> doesn't look like English words, kind of looks like a sentence *if* you
>> assume a space or comma denotes a that which came before is a word)
>>
>> So, in order to handle/fix stuff like that, without training, I'm looking
>> for tips on how to inspect my hOCR files to verify and, if necessary,
>> correct the results, that work on linux without running wine. I am looking
>> into the tools suggested in the "Post OCR Verification and Editing"
>> conversation, but that poster is on windows, with a different toolchain,
>> so, not sure all apply to me.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frCE7PR_%3DBPpKKhYfmK1CPpqs4KbLUGEYH-WWkGBtPAEg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frCE7PR_%3DBPpKKhYfmK1CPpqs4KbLUGEYH-WWkGBtPAEg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6TcNya959P2V124HR4h_Aj32D%3Dx5qA8Xxc1vPByTw6xmg%40mail.gmail.com.

Re: [tesseract-ocr] hOCR verification and editing plus non-word characters

Reply via email to