Re: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-16 Thread Charlie Hull
My colleagues Eric Pugh and Dan Worley covered OCR and Solr in a presentation at our recent London Lucene/Solr Meetup: https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/264579498/ (direct link to slides if you can't find it in the comments https://www.slideshare.net/o19s/payload

RE: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-14 Thread Retro
Hello, thanks for answer, but let me explain the setup. We are running our own backup solution for emails (messages from Exchange in MSG format). Content of these messages then indexed in SOLR. But SOLR can not process attachments within those MSG files, can not OCR them. This is what I need - to O

RE: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-11 Thread Davis, Daniel (NIH/NLM) [C]
rg > Subject: Re: Using Tesseract OCR to extract PDF files in EML file attachment > > AJ Weber wrote > > There are alternative, paid, libraries to parse and extract attachments > > from EML files as well > > EML attachments will have a mimetype associated with their metadat

Re: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-11 Thread Retro
AJ Weber wrote > There are alternative, paid, libraries to parse and extract attachments > from EML files as well > EML attachments will have a mimetype associated with their metadata. Hello, can you give a hint what are those commercial libraries that would do the job? We need to index MSG files

Re: Using Tesseract OCR to extract PDF files in EML file attachment

2017-04-04 Thread AJ Weber
You'll need to use something like javax mail (or some of the jars that have been built on top of it for higher-level access) to open the EML files and extract the attachments, then operate on the extracted attachments as you would any file. There are alternative, paid, libraries to parse and e

Re: Using Tesseract OCR to extract PDF files in EML file attachment

2017-04-03 Thread Rick Leir
Tesseract prolly knows nothing of the EML format. Your scripts could pull EML's apart. On April 4, 2017 2:00:19 AM EDT, Zheng Lin Edwin Yeo wrote: >Hi, > >Currently, I am able to extract scanned PDF images and index them to >Solr >using Tesseract OCR, although the speed is very slow. > >However