What I did to setup for the conversion Note I'm doing this on a CentOS 5.x system
1. Add RpmForge to the YUM repo file
wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
        rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
        rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
        rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm
2. install tesseract
        yum -y install tesseract tesseract-en

This makes it possible to do the following;
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE <Input.pdf>
    where the options are
            -r is resoultion
            -sDEVICE for monocrome output
-sOutputFile=outputfilename note %02d causes the page number to be inserted into the filename
Followed by
        tesseract inputFile outputfile -l eng
    where the options are
            input is the output tif files from gs
            outputfile will be given a .txt extentsion
            -l language of input file <eng>lish
And then put the pages back together by
>cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt

There will some failed conversion/bad guesses by the tesseract program so check the final output for correctness.

Bash Script to do the conversion
< This got reformatted and I attempted to put it back the way I remembered it.>
< the tesseract step takes a while on each page>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
#!/bin/bash
# # # # # # Use this script to convert a pdf formated file to text
# The Input file will be split into single page tiff files
# which will be run through tesseract to OCR the files into
# text files. the text files will be reassimbled into a
# single text file.
# # NOTE: There will still be some cleanup of the text files
# as the OCR is not perfect.
# # # # # # # # Get Input file name, and final output filename
InFile=${1:-"infile.pdf"}
TIFFile="${InFile%.pdf}"
OutFile=${2:-"$TIFFile.txt"}
echo "Input from $InFile, OCR output to $OutFile"
if [ ! -e "$InFile" ] ; then
    echo "$InFile not found. exiting"
    exit 1
elsif [ ! -r "$InFile" ]
    echo " Read not allowed on $InFile. exiting"
    exit 1
fi
# setup a temp working area
WrkDir="/tmp/$(date +%s)"
mkdir $WrkDir
echo " Working Dir = $WrkDir"
cp $InFile $WrkDir/
Hdir=$(pwd)
cd $WrkDir
# pwd
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=$TIFFile%02d.tif -dBATCH -dNOPAUSE $InFile >files
TifCount=$(grep "Page " files | wc -l)
rm files
#
ls -l *.tif
echo "number of pages to process = $TifCount"
for wtif in $(ls *.tif); do
    wtxt=${wtif%.tif}
    tesseract "$wtif" "$wtxt" -l eng
done
#
ls -l *.txt TxtFiles=$( ls *.txt )
touch $OutFile
for Tf in $TxtFiles; do
#
    echo "Working on $Tf, "
    cat "$Tf" >> $OutFile
done
ls -l
cp $OutFile $Hdir/
cd $Hdir
# once debuged enable the following
rm -fr $WrkDir
exit 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

James C

Attachment: Pdf2txt.sh
Description: Binary data

Pdf to Text

The bash script Pdf2txt.sh located in the same directory as this file
will do all the following steps for PDFs upto 99 pages. It is also
on test.sb.state.az.us (10.168.30.100) in /home/jimc.


Convert .pdf document to single page .tif format documents
>gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE 
><Input.pdf>
        -r is resoultion
        -sDEVICE for monocrome output
        -sOutputFile=outputfilename
                note %02d causes the page number to be inserted
                into the filename

next use tesseract to convert each page to text
>tesseract  inputFile outputfile -l eng
        input is the output tif files from gs
        outputfile will be given a .txt extentsion
        -l language of input file <eng>lish

reassemble the ocr'ed .txt files into a single document
>cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt


on test server <10.168.30.100 test.sb.state.az.us>
I have installed tesseract using the following
        yum -y install tesseract tesseract-en
using the Rpmforge repositorys
  wget 
http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm --import  http://apt.sw.be/RPM-GPG-KEY.dag.txt
  rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm


---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Reply via email to