Re: [opensuse] PDF OCR
Am Thursday 13 December 2007 schrieb StephenW: --- Roger Oberholtzer [EMAIL PROTECTED] wrote: Hello We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. I had to do much the same in the past - a quick bash script seemed like the best way to solve it: 1. use pdf2ppm to extract the images from the pdf to a new directory 2. use ppm2tiff on all the extracted ppm files 3. use tesseract or whatever its called these days on the tiff files 4. append the text files to a single text file (or leave them separate, whatever) There's probably a much more sensible way of doing this :-) but this worked consistently for me for quite a number of documents scanned and sent as pdf. Ciaran -- SUSE LINUX Products GmbH GF: Markus Rex HRB 16746 (AG Nuremberg) Maxfeldstrasse 5 90409, Nuremberg Tel: +49 911 74053 262 signature.asc Description: This is a digitally signed message part.
Re: [opensuse] PDF OCR
On Dec 13, 2007 2:18 AM, Roger Oberholtzer [EMAIL PROTECTED] wrote: On Wed, 2007-12-12 at 13:46 -0800, Kai Ponte wrote: Here's a how-to on how to do PDF to text, though I've yet to be able to convert PDF to TIFF yet... From ImageMagick: convert x.pdf x.tiff Or to any format ImageMagick supports, not just tiff. I am not sure how it (or gimp, as was also suggested) works with multiple pages. I just searched the archives and I initiated a similar thread back in April. Apparently I tried convert from ImageMagick and found it lacking. My problem was I wanted greyscale tiffs and I was only able to get BW. I used the pdftoppm / ppm2tiff approach. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
dOn Thu, 2007-12-13 at 08:32 -0500, Greg Freemyer wrote: Apparently I tried convert from ImageMagick and found it lacking. My problem was I wanted greyscale tiffs and I was only able to get BW. I can get color and all. But the quality is terrible compared to what I see in acroread. Same with gimp. pdftoppm does indeed seem to preserve what is in the pdf. I wonder why gimp and convert are so bad by comparison. In fact, they seem to be equally bad. Must share some tool that the the root cause. But I can be happy with pdftoppm. I used the pdftoppm / ppm2tiff approach. -- Roger Oberholtzer OPQ Systems / Ramböll RST Ramböll Sverige AB Kapellgränd 7 P.O. Box 4205 SE-102 65 Stockholm, Sweden Office: Int +46 8-615 60 20 Mobile: Int +46 70-815 1696 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
On Dec 13, 07 09:19:11 +0100, Ciaran Farrell wrote: Am Thursday 13 December 2007 schrieb StephenW: --- Roger Oberholtzer [EMAIL PROTECTED] wrote: Hello We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. I had to do much the same in the past - a quick bash script seemed like the best way to solve it: 1. use pdf2ppm to extract the images from the pdf to a new directory 2. use ppm2tiff on all the extracted ppm files 3. use tesseract or whatever its called these days on the tiff files 4. append the text files to a single text file (or leave them separate, whatever) There's probably a much more sensible way of doing this :-) but this worked consistently for me for quite a number of documents scanned and sent as pdf. This is already the best approach, afaik. I assume ocropus helps layout issus like multicolumn and such. Any volunteers who want to try out ocropus? I see rpm packages in http://download.opensuse.org/repositories/home:/StefanBruens cheers, Jw. -- o \ Juergen Weigert paint it green! __/ _===.===_ V | [EMAIL PROTECTED] wide open suse_/_---|\/ \ | 0911 74053-508 (tm)__/ (//\ (/) | __/ _/ \_ vim:set sw=2 wm=8 SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg) Novell is committed to creating a work environment that embraces clarity. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
Roger Oberholtzer pecked at the keyboard and wrote: Hello We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. Have you tried pdftotext ? pc5:~ # pdftotext -h pdftotext version 3.02 Copyright 1996-2007 Glyph Cog, LLC Usage: pdftotext [options] PDF-file [text-file] -f int : first page to convert -l int : last page to convert -layout : maintain original physical layout -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta information -enc string : output text encoding name -eol string : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -opw string : owner password (for encrypted files) -upw string : user password (for encrypted files) -q: don't print any messages or errors -cfg string : configuration file to use in place of .xpdfrc -v: print copyright and version info -h: print usage information -help : print usage information --help: print usage information -?: print usage information -- Ken Schneider SuSe since Version 5.2, June 1998 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 The Wednesday 2007-12-12 at 13:52 -0500, Ken Schneider wrote: We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. Have you tried pdftotext ? It doesn't do OCR. What it does is extract the text of the PDF that comes already as text. If it comes as an image, like from an scanner, no way! - -- Cheers, Carlos E. R. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFHYEJttTMYHG2NR9URAgOUAJ9CI2ba30+6v5w73ICUsbp5PeZ8tACfRJlZ GEwPpZ91qst9BT9Tcw5Sxic= =EgVe -END PGP SIGNATURE- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 The Wednesday 2007-12-12 at 19:10 +0100, Roger Oberholtzer wrote: We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. I haven't seen any open source OCR that really works. You have to buy it. I'd love to be proved wrong, of course. - -- Cheers, Carlos E. R. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFHYENStTMYHG2NR9URAosQAJ9ziOUMXO+FHajPiMzCkLfPAAnbZwCfeFe8 /zNr7BLE1AY0enAxaH9a2vs= =hCIV -END PGP SIGNATURE- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
On Dec 12, 07 13:52:28 -0500, Ken Schneider wrote: Roger Oberholtzer pecked at the keyboard and wrote: Hello We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. Have you tried pdftotext ? pdftotext won't help with scanned documents. You could check, if ocropus / tesseract is already up to speed ... cheers, Jw. -- o \ Juergen Weigert paint it green! __/ _===.===_ V | [EMAIL PROTECTED] wide open suse_/_---|\/ \ | 0911 74053-508 (tm)__/ (//\ (/) | __/ _/ \_ vim:set sw=2 wm=8 SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg) Novell is committed to creating a work environment that embraces clarity. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
On Wednesday 12 December 2007 10:52, Ken Schneider wrote: Roger Oberholtzer pecked at the keyboard and wrote: Hello We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. Have you tried pdftotext ? I will happily recommend Tesseract. http://code.google.com/p/tesseract-ocr/ Here's a how-to on how to do PDF to text, though I've yet to be able to convert PDF to TIFF yet... http://www.groklaw.net/articlebasic.php?story=20061210115516438 And a few more articles... http://www.linuxjournal.com/article/9676 http://www.howtoforge.com/ocr_with_tesseract_on_ubuntu704 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 * Kai Ponte [EMAIL PROTECTED] [12-12-07 16:48]: Here's a how-to on how to do PDF to text, though I've yet to be able to convert PDF to TIFF yet... http://www.groklaw.net/articlebasic.php?story=20061210115516438 You can open a pdf file with gimp and save it as tiff/jpg/png/. - -- Patrick Shanahan Plainfield, Indiana, USAHOG # US1244711 http://wahoo.no-ip.org Photo Album: http://wahoo.no-ip.org/gallery2 Registered Linux User #207535@ http://counter.li.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) iD8DBQFHYFi1ClSjbQz1U5oRAh+RAKCj0Opnkp1XG7+brNDI7PrfKuZNYQCgoShO Odj3MSTBFkbALFBa0UihnV4= =Tqjz -END PGP SIGNATURE- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
Carlos E. R. pecked at the keyboard and wrote: The Wednesday 2007-12-12 at 19:10 +0100, Roger Oberholtzer wrote: We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. I haven't seen any open source OCR that really works. You have to buy it. I'd love to be proved wrong, of course. -- Cheers, Carlos E. R. I have used SimpleOCR (shareware) under wine and it works quite well. -- Ken Schneider SuSe since Version 5.2, June 1998 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
I will happily recommend Tesseract. http://code.google.com/p/tesseract-ocr/ Here's a how-to on how to do PDF to text, though I've yet to be able to convert PDF to TIFF yet... I wrote a bash script to do that once. Descends into subdirectories etc. and makes a duplicate directory structure of tiffs. It used pdftoppm and then ppm2tiff. Seemed to work pretty good for me when I was testing. Never really used it for production work. If your interested in the script (and you have some bash scripting skills so you can read it), let me know in a private e-mail. I send you a copy. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
On Wednesday 12 December 2007 13:55, Patrick Shanahan wrote: * Kai Ponte [EMAIL PROTECTED] [12-12-07 16:48]: Here's a how-to on how to do PDF to text, though I've yet to be able to convert PDF to TIFF yet... http://www.groklaw.net/articlebasic.php?story=20061210115516438 You can open a pdf file with gimp and save it as tiff/jpg/png/. Homer Simpson Doh! /Homer Simpson I should've remember that. Thanks! -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 The Wednesday 2007-12-12 at 16:56 -0500, Ken Schneider wrote: Carlos E. R. pecked at the keyboard and wrote: I haven't seen any open source OCR that really works. You have to buy it. I'd love to be proved wrong, of course. I have used SimpleOCR (shareware) under wine and it works quite well. And the one that I got with my scanner (epson p 1650) is pretty good, but: - it only works in windows - it needs to scan the page itself, it will not even look at a file. Ie, it is crippled on purpose so that you can not use it with another scanner. - -- Cheers, Carlos E. R. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFHYGXotTMYHG2NR9URAg2NAJ9xjgIpeBYC3Kp0e6/TdbjQKIZobACdGxIA A4OmVcoeVdXMZFTqYUpwE2E= =yETv -END PGP SIGNATURE- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
--- Roger Oberholtzer [EMAIL PROTECTED] wrote: Hello We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome. -- Roger Oberholtzer OPQ Systems / Ramböll RST Ramböll Sverige AB Kapellgränd 7 P.O. Box 4205 SE-102 65 Stockholm, Sweden Office: Int +46 8-615 60 20 Mobile: Int +46 70-815 1696 Have you tried PDFedit? Stephen W Sarasota, FL USA Ignorance more frequently begets confidence than does knowledge. -Charles Darwin, naturalist and author (1809-1882) -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [opensuse] PDF OCR
On Wed, 2007-12-12 at 13:46 -0800, Kai Ponte wrote: Here's a how-to on how to do PDF to text, though I've yet to be able to convert PDF to TIFF yet... From ImageMagick: convert x.pdf x.tiff Or to any format ImageMagick supports, not just tiff. I am not sure how it (or gimp, as was also suggested) works with multiple pages. -- Roger Oberholtzer OPQ Systems / Ramböll RST Ramböll Sverige AB Kapellgränd 7 P.O. Box 4205 SE-102 65 Stockholm, Sweden Office: Int +46 8-615 60 20 Mobile: Int +46 70-815 1696 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]