Hi Abilash Mathew,

I haven't evaluated a per-character (or per-word) accuracy, but things look 
fair (80-90%? per char) from spot-checks.  Obviously the accuracy is dependent 
upon the quality of the input, and sometimes you get a lousy 300dpi scan made 
from 200dpi printed fax ...

I do run pdf first through imagemagik, but that is more to get pagination than 
anything else as sometimes an entire patient history is in one giant pdf.

I am not an expert on ocr or any particular tool or method.  There is a lot of 
Tesseract discussion on the web if you can find it.

Sean

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Thursday, October 19, 2017 1:20 AM
To: [email protected]
Subject: RE: OCR engine used [EXTERNAL]

Sean,

What is the accuracy that you get from OCR? We are at  60-70% accuracy.  Most 
of the documents are 200 DPI ones. Also, are you using any other software like 
Matlab for the OCR pre or  post processing.

Thanks,
Abilash Mathew

-----Original Message-----
From: Mathew, Abilash (Cognizant)
Sent: Monday, October 16, 2017 8:37 PM
To: [email protected]
Subject: RE: OCR engine used [EXTERNAL]

Thanks Sean fir the quick reply and providing the valuable information.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:[email protected]]
Sent: Monday, October 16, 2017 8:17 PM
To: [email protected]
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.
I am by no means an expert on Tesseract, but perhaps I can help to get you 
started ...

There are tricks that you can use to get it to work better with medical notes 
(besides training on fonts).  Possibly the most effective is using a whitelist 
of desired characters using tessedit_char_whitelist and a series of characters 
that doesn't include things like hash, dollar, bar ...  Another is to add a 
wordlist that contains words pertinent to your domain.  See:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_wiki_ImproveQuality-23dictionaries-2Dword-2Dlists-2Dand-2Dpatterns&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=mdvTV4CsdGjAgIX6yNzNYCrkBuDVrNvOgxKiv-R9vxI&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_blob_master_doc_tesseract.1.asc-23config-2Dfiles-2Dand-2Daugmenting-2Dwith-2Duser-2Ddata&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=w6BHFOtmh6VsGVBFaH2yhVLqxyezeW8ozgRhM67ImS0&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_9568165_custom-2Ddictionary-2Dfor-2Dtesseract&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=vqJa6rcFsmUgCotpp3fbfF6epW4WiHCJWugr4eFIyWs&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_tesseract-2Docr-40googlegroups.com_msg10100.html&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=ORz4k4McDLmQa64dLEgFCE-oVBIW0LNNh2mVMb2T2Xk&e=

Good luck,
Sean

-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Monday, October 16, 2017 10:13 AM
To: [email protected]
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text 
extraction from images? I am currently using tesseract and seeing some  text 
extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.

Reply via email to