Re: [opensuse] PDF OCR

2007-12-13 Thread Ciaran Farrell
Am Thursday 13 December 2007 schrieb StephenW:
 --- Roger Oberholtzer [EMAIL PROTECTED] wrote:
  Hello
 
  We have a network printer that will scan docs and send them as pdf docs
  to an e-mail address in the company. Is there any software with OpenSUSE
  10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
  tiff images of the scanned documents. Any and all pointers are welcome.

I had to do much the same in the past - a quick bash script seemed like the 
best way to solve it:

1. use pdf2ppm to extract the images from the pdf to a new directory
2. use ppm2tiff on all the extracted ppm files
3. use tesseract or whatever its called these days on the tiff files
4. append the text files to a single text file (or leave them separate, 
whatever)

There's probably a much more sensible way of doing this :-) but this worked 
consistently for me for quite a number of documents scanned and sent as pdf.

Ciaran



-- 
SUSE LINUX Products GmbH
GF: Markus Rex
HRB 16746 (AG Nuremberg)
Maxfeldstrasse 5
90409, Nuremberg
Tel: +49 911 74053 262


signature.asc
Description: This is a digitally signed message part.


Re: [opensuse] PDF OCR

2007-12-13 Thread Greg Freemyer
On Dec 13, 2007 2:18 AM, Roger Oberholtzer [EMAIL PROTECTED] wrote:
 On Wed, 2007-12-12 at 13:46 -0800, Kai Ponte wrote:

  Here's a how-to on how to do PDF to text, though I've yet to be able to
  convert PDF to TIFF yet...

 From ImageMagick:  convert x.pdf x.tiff

 Or to any format ImageMagick supports, not just tiff. I am not sure how
 it (or gimp, as was also suggested) works with multiple pages.


I just searched the archives and I initiated a similar thread back in April.

Apparently I tried convert from ImageMagick and found it lacking.
My problem was I wanted greyscale tiffs and I was only able to get
BW.

I used the pdftoppm / ppm2tiff approach.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence  Technology
http://www.norcrossgroup.com
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-13 Thread Roger Oberholtzer
dOn Thu, 2007-12-13 at 08:32 -0500, Greg Freemyer wrote:

 Apparently I tried convert from ImageMagick and found it lacking.
 My problem was I wanted greyscale tiffs and I was only able to get
 BW.

I can get color and all. But the quality is terrible compared to what I
see in acroread. Same with gimp. pdftoppm does indeed seem to preserve
what is in the pdf. I wonder why gimp and convert are so bad by
comparison. In fact, they seem to be equally bad. Must share some tool
that the the root cause. But I can be happy with pdftoppm.

 I used the pdftoppm / ppm2tiff approach.

-- 
Roger Oberholtzer

OPQ Systems / Ramböll RST

Ramböll Sverige AB
Kapellgränd 7
P.O. Box 4205
SE-102 65 Stockholm, Sweden

Office: Int +46 8-615 60 20
Mobile: Int +46 70-815 1696

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-13 Thread Juergen Weigert
On Dec 13, 07 09:19:11 +0100, Ciaran Farrell wrote:
 Am Thursday 13 December 2007 schrieb StephenW:
  --- Roger Oberholtzer [EMAIL PROTECTED] wrote:
   Hello
  
   We have a network printer that will scan docs and send them as pdf docs
   to an e-mail address in the company. Is there any software with OpenSUSE
   10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
   tiff images of the scanned documents. Any and all pointers are welcome.
 
 I had to do much the same in the past - a quick bash script seemed like the 
 best way to solve it:
 
 1. use pdf2ppm to extract the images from the pdf to a new directory
 2. use ppm2tiff on all the extracted ppm files
 3. use tesseract or whatever its called these days on the tiff files
 4. append the text files to a single text file (or leave them separate, 
 whatever)
 
 There's probably a much more sensible way of doing this :-) but this worked 
 consistently for me for quite a number of documents scanned and sent as pdf.

This is already the best approach, afaik.
I assume ocropus helps layout issus like multicolumn and such.

Any volunteers who want to try out ocropus?
I see rpm packages in
http://download.opensuse.org/repositories/home:/StefanBruens

cheers,
Jw.

-- 
 o \  Juergen Weigert  paint it green! __/ _===.===_
V | [EMAIL PROTECTED]   wide open suse_/_---|\/
 \  | 0911 74053-508 (tm)__/  (//\
(/) | __/ _/ \_ vim:set sw=2 wm=8
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg)
Novell is committed to creating a work environment that embraces clarity.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Ken Schneider
Roger Oberholtzer pecked at the keyboard and wrote:
 Hello
 
 We have a network printer that will scan docs and send them as pdf docs
 to an e-mail address in the company. Is there any software with OpenSUSE
 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
 tiff images of the scanned documents. Any and all pointers are welcome.
 

Have you tried pdftotext ?

pc5:~ # pdftotext -h
pdftotext version 3.02
Copyright 1996-2007 Glyph  Cog, LLC
Usage: pdftotext [options] PDF-file [text-file]
  -f int  : first page to convert
  -l int  : last page to convert
  -layout   : maintain original physical layout
  -raw  : keep strings in content stream order
  -htmlmeta : generate a simple HTML file, including the meta
information
  -enc string : output text encoding name
  -eol string : output end-of-line convention (unix, dos, or mac)
  -nopgbrk  : don't insert page breaks between pages
  -opw string : owner password (for encrypted files)
  -upw string : user password (for encrypted files)
  -q: don't print any messages or errors
  -cfg string : configuration file to use in place of .xpdfrc
  -v: print copyright and version info
  -h: print usage information
  -help : print usage information
  --help: print usage information
  -?: print usage information

-- 
Ken Schneider
SuSe since Version 5.2, June 1998
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Carlos E. R.

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



The Wednesday 2007-12-12 at 13:52 -0500, Ken Schneider wrote:


We have a network printer that will scan docs and send them as pdf docs
to an e-mail address in the company. Is there any software with OpenSUSE
10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
tiff images of the scanned documents. Any and all pointers are welcome.



Have you tried pdftotext ?


It doesn't do OCR. What it does is extract the text of the PDF that comes 
already as text. If it comes as an image, like from an scanner, no way!


- -- 
Cheers,

   Carlos E. R.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHYEJttTMYHG2NR9URAgOUAJ9CI2ba30+6v5w73ICUsbp5PeZ8tACfRJlZ
GEwPpZ91qst9BT9Tcw5Sxic=
=EgVe
-END PGP SIGNATURE-
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Carlos E. R.

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



The Wednesday 2007-12-12 at 19:10 +0100, Roger Oberholtzer wrote:


We have a network printer that will scan docs and send them as pdf docs
to an e-mail address in the company. Is there any software with OpenSUSE
10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
tiff images of the scanned documents. Any and all pointers are welcome.


I haven't seen any open source OCR that really works. You have to buy it. 
I'd love to be proved wrong, of course.


- -- 
Cheers,

   Carlos E. R.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHYENStTMYHG2NR9URAosQAJ9ziOUMXO+FHajPiMzCkLfPAAnbZwCfeFe8
/zNr7BLE1AY0enAxaH9a2vs=
=hCIV
-END PGP SIGNATURE-
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Juergen Weigert
On Dec 12, 07 13:52:28 -0500, Ken Schneider wrote:
 Roger Oberholtzer pecked at the keyboard and wrote:
  Hello
  
  We have a network printer that will scan docs and send them as pdf docs
  to an e-mail address in the company. Is there any software with OpenSUSE
  10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
  tiff images of the scanned documents. Any and all pointers are welcome.
  
 
 Have you tried pdftotext ?

pdftotext won't help with scanned documents.
You could check, if ocropus / tesseract is already up to speed ...

cheers,
Jw.


-- 
 o \  Juergen Weigert  paint it green! __/ _===.===_
V | [EMAIL PROTECTED]   wide open suse_/_---|\/
 \  | 0911 74053-508 (tm)__/  (//\
(/) | __/ _/ \_ vim:set sw=2 wm=8
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg)
Novell is committed to creating a work environment that embraces clarity.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Kai Ponte
On Wednesday 12 December 2007 10:52, Ken Schneider wrote:
 Roger Oberholtzer pecked at the keyboard and wrote:
  Hello
 
  We have a network printer that will scan docs and send them as pdf docs
  to an e-mail address in the company. Is there any software with OpenSUSE
  10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
  tiff images of the scanned documents. Any and all pointers are welcome.

 Have you tried pdftotext ?


I will happily recommend Tesseract.  

http://code.google.com/p/tesseract-ocr/

Here's a how-to on how to do PDF to text, though I've yet to be able to 
convert PDF to TIFF yet...

http://www.groklaw.net/articlebasic.php?story=20061210115516438

And a few more articles...

http://www.linuxjournal.com/article/9676

http://www.howtoforge.com/ocr_with_tesseract_on_ubuntu704

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Patrick Shanahan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

* Kai Ponte [EMAIL PROTECTED] [12-12-07 16:48]:
 Here's a how-to on how to do PDF to text, though I've yet to be able to 
 convert PDF to TIFF yet...
 
 http://www.groklaw.net/articlebasic.php?story=20061210115516438


You can open a pdf file with gimp and save it as tiff/jpg/png/.

- -- 
Patrick Shanahan Plainfield, Indiana, USAHOG # US1244711
http://wahoo.no-ip.org Photo Album:  http://wahoo.no-ip.org/gallery2
Registered Linux User #207535@ http://counter.li.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFHYFi1ClSjbQz1U5oRAh+RAKCj0Opnkp1XG7+brNDI7PrfKuZNYQCgoShO
Odj3MSTBFkbALFBa0UihnV4=
=Tqjz
-END PGP SIGNATURE-
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Ken Schneider
Carlos E. R. pecked at the keyboard and wrote:
 
 
 The Wednesday 2007-12-12 at 19:10 +0100, Roger Oberholtzer wrote:
 
 We have a network printer that will scan docs and send them as pdf docs
 to an e-mail address in the company. Is there any software with OpenSUSE
 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
 tiff images of the scanned documents. Any and all pointers are welcome.
 
 I haven't seen any open source OCR that really works. You have to buy
 it. I'd love to be proved wrong, of course.
 
 -- Cheers,
Carlos E. R.
 

I have used SimpleOCR (shareware) under wine and it works quite well.

-- 
Ken Schneider
SuSe since Version 5.2, June 1998
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Greg Freemyer

 I will happily recommend Tesseract.

 http://code.google.com/p/tesseract-ocr/

 Here's a how-to on how to do PDF to text, though I've yet to be able to
 convert PDF to TIFF yet...

I wrote a bash script to do that once.  Descends into subdirectories
etc. and makes a duplicate directory structure of tiffs.

It used pdftoppm and then ppm2tiff.

Seemed to work pretty good for me when I was testing.  Never really
used it for production work.

If your interested in the script (and you have some bash scripting
skills so you can read it), let me know in a private e-mail.  I send
you a copy.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence  Technology
http://www.norcrossgroup.com
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Kai Ponte
On Wednesday 12 December 2007 13:55, Patrick Shanahan wrote:
 * Kai Ponte [EMAIL PROTECTED] [12-12-07 16:48]:
  Here's a how-to on how to do PDF to text, though I've yet to be able to
  convert PDF to TIFF yet...
 
  http://www.groklaw.net/articlebasic.php?story=20061210115516438

 You can open a pdf file with gimp and save it as tiff/jpg/png/.

Homer Simpson
Doh!
/Homer Simpson

I should've remember that.

Thanks!
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Carlos E. R.

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



The Wednesday 2007-12-12 at 16:56 -0500, Ken Schneider wrote:


Carlos E. R. pecked at the keyboard and wrote:



I haven't seen any open source OCR that really works. You have to buy
it. I'd love to be proved wrong, of course.


I have used SimpleOCR (shareware) under wine and it works quite well.


And the one that I got with my scanner (epson p 1650) is pretty good, but:

 - it only works in windows
 - it needs to scan the page itself, it will not even look at a file.

Ie, it is crippled on purpose so that you can not use it with another 
scanner.


- -- 
Cheers,

   Carlos E. R.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHYGXotTMYHG2NR9URAg2NAJ9xjgIpeBYC3Kp0e6/TdbjQKIZobACdGxIA
A4OmVcoeVdXMZFTqYUpwE2E=
=yETv
-END PGP SIGNATURE-
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread StephenW

--- Roger Oberholtzer [EMAIL PROTECTED] wrote:

 Hello
 
 We have a network printer that will scan docs and send them as pdf docs
 to an e-mail address in the company. Is there any software with OpenSUSE
 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
 tiff images of the scanned documents. Any and all pointers are welcome.
 
 -- 
 Roger Oberholtzer
 
 OPQ Systems / Ramböll RST
 
 Ramböll Sverige AB
 Kapellgränd 7
 P.O. Box 4205
 SE-102 65 Stockholm, Sweden
 
 Office: Int +46 8-615 60 20
 Mobile: Int +46 70-815 1696

Have you tried PDFedit?


Stephen W
Sarasota, FL USA

Ignorance more frequently begets confidence than does knowledge. 
-Charles Darwin, naturalist and author (1809-1882)
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [opensuse] PDF OCR

2007-12-12 Thread Roger Oberholtzer
On Wed, 2007-12-12 at 13:46 -0800, Kai Ponte wrote:

 Here's a how-to on how to do PDF to text, though I've yet to be able to 
 convert PDF to TIFF yet...

From ImageMagick:  convert x.pdf x.tiff

Or to any format ImageMagick supports, not just tiff. I am not sure how
it (or gimp, as was also suggested) works with multiple pages.

-- 
Roger Oberholtzer

OPQ Systems / Ramböll RST

Ramböll Sverige AB
Kapellgränd 7
P.O. Box 4205
SE-102 65 Stockholm, Sweden

Office: Int +46 8-615 60 20
Mobile: Int +46 70-815 1696

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]