What I did to setup for the conversion Note I'm doing this on a CentOS 5.x system
1. Add RpmForge to the YUM repo filewget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm 2. install tesseract yum -y install tesseract tesseract-en This makes it possible to do the following;gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE <Input.pdf>
where the options are -r is resoultion -sDEVICE for monocrome output-sOutputFile=outputfilename note %02d causes the page number to be inserted into the filename
Followed by tesseract inputFile outputfile -l eng where the options are input is the output tif files from gs outputfile will be given a .txt extentsion -l language of input file <eng>lish And then put the pages back together by>cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt
There will some failed conversion/bad guesses by the tesseract program so check the final output for correctness.
Bash Script to do the conversion< This got reformatted and I attempted to put it back the way I remembered it.>
< the tesseract step takes a while on each page> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #!/bin/bash # # # # # # Use this script to convert a pdf formated file to text # The Input file will be split into single page tiff files # which will be run through tesseract to OCR the files into # text files. the text files will be reassimbled into a # single text file. # # NOTE: There will still be some cleanup of the text files # as the OCR is not perfect. # # # # # # # # Get Input file name, and final output filename InFile=${1:-"infile.pdf"} TIFFile="${InFile%.pdf}" OutFile=${2:-"$TIFFile.txt"} echo "Input from $InFile, OCR output to $OutFile" if [ ! -e "$InFile" ] ; then echo "$InFile not found. exiting" exit 1 elsif [ ! -r "$InFile" ] echo " Read not allowed on $InFile. exiting" exit 1 fi # setup a temp working area WrkDir="/tmp/$(date +%s)" mkdir $WrkDir echo " Working Dir = $WrkDir" cp $InFile $WrkDir/ Hdir=$(pwd) cd $WrkDir # pwdgs -r300x300 -sDEVICE=tiffgray -sOutputFile=$TIFFile%02d.tif -dBATCH -dNOPAUSE $InFile >files
TifCount=$(grep "Page " files | wc -l) rm files # ls -l *.tif echo "number of pages to process = $TifCount" for wtif in $(ls *.tif); do wtxt=${wtif%.tif} tesseract "$wtif" "$wtxt" -l eng done # ls -l *.txt TxtFiles=$( ls *.txt ) touch $OutFile for Tf in $TxtFiles; do # echo "Working on $Tf, " cat "$Tf" >> $OutFile done ls -l cp $OutFile $Hdir/ cd $Hdir # once debuged enable the following rm -fr $WrkDir exit 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> James C
Pdf2txt.sh
Description: Binary data
Pdf to Text The bash script Pdf2txt.sh located in the same directory as this file will do all the following steps for PDFs upto 99 pages. It is also on test.sb.state.az.us (10.168.30.100) in /home/jimc. Convert .pdf document to single page .tif format documents >gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE ><Input.pdf> -r is resoultion -sDEVICE for monocrome output -sOutputFile=outputfilename note %02d causes the page number to be inserted into the filename next use tesseract to convert each page to text >tesseract inputFile outputfile -l eng input is the output tif files from gs outputfile will be given a .txt extentsion -l language of input file <eng>lish reassemble the ocr'ed .txt files into a single document >cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt on test server <10.168.30.100 test.sb.state.az.us> I have installed tesseract using the following yum -y install tesseract tesseract-en using the Rpmforge repositorys wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm
--------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss