Re: [PHP] PDF to Text
Jay Blanchard wrote: [snip] I am trying to find a way for a program to search through the text on a PDF. My first thought was to use pdftotext, but the PDFs generated by our commercial scanner/copier/printer machine do not seem to work with pdftotext... it just outputs two CRLFs. I've been looking around on the net for something similar that might work. Anyone know of something like that? Thanks, -- Ray Hauge Things I forgot to post: It is a PHP script. I was planning on using shell_exec() to call the program and read the output from stdout. [/snip] Sounds like the PDF's are images and therefore will not be readable by anything, save for eyeballs. I have run into this quite a bit. The scanner scans the doc via a TWAIN driver, which then converts the info into an image of that which was scanned. It would be like trying to read text programmatically from a JPEG.not really possible. http://www.cs.wisc.edu/~ghost/ will do it. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] PDF to Text
On Thursday 20 April 2006 19:23, Richard Lynch wrote: > Actually, it's "possible" just bloody difficult. > > You're looking into a topic known as OCR (Optical Character Recognition). > > One OS project for this is: > GOCR (aka JOCR) > It's GOCR on freshmeat and JOCR on sourceforge because they name they > wanted was "taken" by another project. :-( > > A commercial product known as OmniPages is probably the "best" > solution, unfortunately. > Thanks for the info. It makes sense that the scanner puts makes the image and puts that on the PDF. I'll have to look into GOCR, or just scrap the idea I had. Luckily I'm still just in the planning stage and we haven't figured out how all the processes are going to work :) Thanks again, -- Ray Hauge Programmer/Systems Administrator American Student Loan Services www.americanstudentloan.com 1.800.575.1099 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] PDF to Text
On Thu, April 20, 2006 8:59 pm, Jay Blanchard wrote: > [snip] >> I am trying to find a way for a program to search through the text >> on > a >> PDF. My first thought was to use pdftotext, but the PDFs generated >> by > our >> commercial scanner/copier/printer machine do not seem to work with >> pdftotext... it just outputs two CRLFs. I've been looking around on > the >> net for something similar that might work. >> >> Anyone know of something like that? >> >> Thanks, >> -- >> Ray Hauge > > Things I forgot to post: > > It is a PHP script. I was planning on using shell_exec() to call the > program > and read the output from stdout. > [/snip] > > Sounds like the PDF's are images and therefore will not be readable by > anything, save for eyeballs. I have run into this quite a bit. The > scanner scans the doc via a TWAIN driver, which then converts the info > into an image of that which was scanned. It would be like trying to > read > text programmatically from a JPEG.not really possible. Actually, it's "possible" just bloody difficult. You're looking into a topic known as OCR (Optical Character Recognition). One OS project for this is: GOCR (aka JOCR) It's GOCR on freshmeat and JOCR on sourceforge because they name they wanted was "taken" by another project. :-( A commercial product known as OmniPages is probably the "best" solution, unfortunately. Some interesting options. I've been thinking of maybe maybe writing a 'real' extension to PHP, and GOCR/JOCR is one of the candidates I'd consider... You also could, theoretically, convert the PDF to an image of some kind, pull it into GD, and then roll your own package based around: http://php.net/imagecolorat -- along with a zillion lines of code to reduce noise, detect edges, and compute "distance" between two glyphs... I did something like this on a very very very small and limited scale recently, but it's not code I can publish nor is it truly useful to you anyway. Your best bet at this point is to search for "PDF OCR" and/or "PDF to image" and then "OCR" separately and hope to find two packages together that will suit your needs. Note that OCR is, at best, only going to correctly convert ~95% of the PDF into text. If you need error-free conversion, forget software automation and do it by hand, or count on a human intervention step in the process to correct the transcription, because you will NOT get 100% Even ~9x% assumes good clean images and a lot of factors in the image-quality can lower that drastically fast. -- Like Music? http://l-i-e.com/artists.htm -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] PDF to Text
[snip] > I am trying to find a way for a program to search through the text on a > PDF. My first thought was to use pdftotext, but the PDFs generated by our > commercial scanner/copier/printer machine do not seem to work with > pdftotext... it just outputs two CRLFs. I've been looking around on the > net for something similar that might work. > > Anyone know of something like that? > > Thanks, > -- > Ray Hauge Things I forgot to post: It is a PHP script. I was planning on using shell_exec() to call the program and read the output from stdout. [/snip] Sounds like the PDF's are images and therefore will not be readable by anything, save for eyeballs. I have run into this quite a bit. The scanner scans the doc via a TWAIN driver, which then converts the info into an image of that which was scanned. It would be like trying to read text programmatically from a JPEG.not really possible. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] PDF to Text
On Thursday 20 April 2006 18:06, Ray Hauge wrote: > Hello List, > > I am trying to find a way for a program to search through the text on a > PDF. My first thought was to use pdftotext, but the PDFs generated by our > commercial scanner/copier/printer machine do not seem to work with > pdftotext... it just outputs two CRLFs. I've been looking around on the > net for something similar that might work. > > Anyone know of something like that? > > Thanks, > -- > Ray Hauge Things I forgot to post: It is a PHP script. I was planning on using shell_exec() to call the program and read the output from stdout. -- Ray Hauge Programmer/Systems Administrator American Student Loan Services www.americanstudentloan.com 1.800.575.1099 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] PDF to Text
Hello List, I am trying to find a way for a program to search through the text on a PDF. My first thought was to use pdftotext, but the PDFs generated by our commercial scanner/copier/printer machine do not seem to work with pdftotext... it just outputs two CRLFs. I've been looking around on the net for something similar that might work. Anyone know of something like that? Thanks, -- Ray Hauge Programmer/Systems Administrator American Student Loan Services www.americanstudentloan.com 1.800.575.1099 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php