Re: [PHP] PDF to Text

2006-04-21 Thread Al

Jay Blanchard wrote:

[snip]

I am trying to find a way for a program to search through the text on

a

PDF. My first thought was to use pdftotext, but the PDFs generated by

our

commercial scanner/copier/printer machine do not seem to work with
pdftotext... it just outputs two CRLFs.  I've been looking around on

the

net for something similar that might work.

Anyone know of something like that?

Thanks,
--
Ray Hauge


Things I forgot to post:

It is a PHP script.  I was planning on using shell_exec() to call the
program 
and read the output from stdout.

[/snip]

Sounds like the PDF's are images and therefore will not be readable by
anything, save for eyeballs. I have run into this quite a bit. The
scanner scans the doc via a TWAIN driver, which then converts the info
into an image of that which was scanned. It would be like trying to read
text programmatically from a JPEG.not really possible.



http://www.cs.wisc.edu/~ghost/  will do it.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] PDF to Text

2006-04-21 Thread Ray Hauge
On Thursday 20 April 2006 19:23, Richard Lynch wrote:
> Actually, it's "possible" just bloody difficult.
>
> You're looking into a topic known as OCR (Optical Character Recognition).
>
> One OS project for this is:
> GOCR (aka JOCR)
> It's GOCR on freshmeat and JOCR on sourceforge because they name they
> wanted was "taken" by another project. :-(
>
> A commercial product known as OmniPages is probably the "best"
> solution, unfortunately.
>

Thanks for the info.  It makes sense that the scanner puts makes the image and 
puts that on the PDF.  I'll have to look into GOCR, or just scrap the idea I 
had.  Luckily I'm still just in the planning stage and we haven't figured out 
how all the processes are going to work :)

Thanks again,

-- 
Ray Hauge
Programmer/Systems Administrator
American Student Loan Services
www.americanstudentloan.com
1.800.575.1099

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP] PDF to Text

2006-04-20 Thread Richard Lynch
On Thu, April 20, 2006 8:59 pm, Jay Blanchard wrote:
> [snip]
>> I am trying to find a way for a program to search through the text
>> on
> a
>> PDF. My first thought was to use pdftotext, but the PDFs generated
>> by
> our
>> commercial scanner/copier/printer machine do not seem to work with
>> pdftotext... it just outputs two CRLFs.  I've been looking around on
> the
>> net for something similar that might work.
>>
>> Anyone know of something like that?
>>
>> Thanks,
>> --
>> Ray Hauge
>
> Things I forgot to post:
>
> It is a PHP script.  I was planning on using shell_exec() to call the
> program
> and read the output from stdout.
> [/snip]
>
> Sounds like the PDF's are images and therefore will not be readable by
> anything, save for eyeballs. I have run into this quite a bit. The
> scanner scans the doc via a TWAIN driver, which then converts the info
> into an image of that which was scanned. It would be like trying to
> read
> text programmatically from a JPEG.not really possible.

Actually, it's "possible" just bloody difficult.

You're looking into a topic known as OCR (Optical Character Recognition).

One OS project for this is:
GOCR (aka JOCR)
It's GOCR on freshmeat and JOCR on sourceforge because they name they
wanted was "taken" by another project. :-(

A commercial product known as OmniPages is probably the "best"
solution, unfortunately.

Some interesting options.

I've been thinking of maybe maybe writing a 'real' extension to PHP,
and GOCR/JOCR is one of the candidates I'd consider...

You also could, theoretically, convert the PDF to an image of some
kind,  pull it into GD, and then roll your own package based around:
http://php.net/imagecolorat
-- along with a zillion lines of code to reduce noise, detect edges,
and compute "distance" between two glyphs...

I did something like this on a very very very small and limited scale
recently, but it's not code I can publish nor is it truly useful to
you anyway.

Your best bet at this point is to search for "PDF OCR" and/or "PDF to
image" and then "OCR" separately and hope to find two packages
together that will suit your needs.

Note that OCR is, at best, only going to correctly convert ~95% of the
PDF into text.

If you need error-free conversion, forget software automation and do
it by hand, or count on a human intervention step in the process to
correct the transcription, because you will NOT get 100%

Even ~9x% assumes good clean images and a lot of factors in the
image-quality can lower that drastically fast.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP] PDF to Text

2006-04-20 Thread Jay Blanchard
[snip]
> I am trying to find a way for a program to search through the text on
a
> PDF. My first thought was to use pdftotext, but the PDFs generated by
our
> commercial scanner/copier/printer machine do not seem to work with
> pdftotext... it just outputs two CRLFs.  I've been looking around on
the
> net for something similar that might work.
>
> Anyone know of something like that?
>
> Thanks,
> --
> Ray Hauge

Things I forgot to post:

It is a PHP script.  I was planning on using shell_exec() to call the
program 
and read the output from stdout.
[/snip]

Sounds like the PDF's are images and therefore will not be readable by
anything, save for eyeballs. I have run into this quite a bit. The
scanner scans the doc via a TWAIN driver, which then converts the info
into an image of that which was scanned. It would be like trying to read
text programmatically from a JPEG.not really possible.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] PDF to Text

2006-04-20 Thread Ray Hauge
On Thursday 20 April 2006 18:06, Ray Hauge wrote:
> Hello List,
>
> I am trying to find a way for a program to search through the text on a
> PDF. My first thought was to use pdftotext, but the PDFs generated by our
> commercial scanner/copier/printer machine do not seem to work with
> pdftotext... it just outputs two CRLFs.  I've been looking around on the
> net for something similar that might work.
>
> Anyone know of something like that?
>
> Thanks,
> --
> Ray Hauge

Things I forgot to post:

It is a PHP script.  I was planning on using shell_exec() to call the program 
and read the output from stdout.

-- 
Ray Hauge
Programmer/Systems Administrator
American Student Loan Services
www.americanstudentloan.com
1.800.575.1099

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] PDF to Text

2006-04-20 Thread Ray Hauge
Hello List,

I am trying to find a way for a program to search through the text on a PDF.  
My first thought was to use pdftotext, but the PDFs generated by our 
commercial scanner/copier/printer machine do not seem to work with 
pdftotext... it just outputs two CRLFs.  I've been looking around on the net 
for something similar that might work.

Anyone know of something like that?

Thanks,
-- 
Ray Hauge
Programmer/Systems Administrator
American Student Loan Services
www.americanstudentloan.com
1.800.575.1099

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php