Re: editing pdf files

Gary Kline Sat, 13 Oct 2012 13:45:09 -0700

On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote:
> On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote:
> >     ive got a question that fits in here.  hopefully.
> > 
> >     last week  I found a book from 1901 that google had scanned and listed
> >     as a pdf file.  it was text plus photos of the rich/famous of the 
> >     1800s.  somehow, google found the exact string that matched my great
> >     grandfather [from the civil war].  I d'loaded the file (maybe 2mbytes)
> >     and searched using acroread.  nada.  I used the pdftotext utility.
> >     same: nothing but  some 600 page numbers.
> > 
> >     my guess is that google just took photos of the book and used other
> >     tools to create a pdf file.  I am not =that= serious  about genealogy,
> >     but I would like to know if there are any tools to edit this kind of
> >     pdf file.
> 
> In case the PDF is nothing more than a compilation of images,
> there's a way to deal with it for editing:



        the images in this book aren't what I am interested in.
        just text.

> 
> step 1: disassemble
> step 2: edit images
> step 3: reassemble
> 
> The disassembling can be done with 
> 
>       % pdfimages source.pdf .
> 
> Then the files can be edited whatever tool you like, e. g. Gimp.
> They often come out in PBM format.
> 
> Finally the images can be re-converted to PDF and combined to one
> PDF file:
> 
>       for IMG in .*.pbm; do
>               convert ${IMG} ${IMG}.pdf
>       done
>       pdftk .*.pdf output target.pdf
> 
> Note the ".*" prefix for the file specification: The images extracted
> by pdfimages match that pattern (at least in the case I tested it for).
> If they get other names than .0000001.pbm, change the approach
> accordingly.
> 

        turns out that the first roughtly 580 pages are of no interest.
        I'll see if tesseract-ocr can get rid of most of the data.

        what fmt works best with the ocr suites?  or are they about the 
        same?  for the section I got in that 1901 book on my g-grandfather,
        it was only about 1.5 pages.  there was no photo, just his name 
        and some bio.  Still, things I had no knowledge of.  I'm sure 
        that my father didnt know either!

        gary

> 
> 
> -- 
> Polytropon
> Magdeburg, Germany
> Happy FreeBSD user since 4.0
> Andra moi ennepe, Mousa, ...

-- 
 Gary Kline  [email protected]  http://www.thought.org  Public Service Unix
              Twenty-six years of service to the Unix community.

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[email protected]"

Re: editing pdf files

Reply via email to