Re: editing pdf files
On Saturday 13 October 2012 21:47:01 Gary Kline wrote: > SO: Is pdfimages going to spit of 6t50 files? as noted > in last email, only a couple of these images are of any interest Probably. But Gimp accepts PDF files and gives you the option of importing images of individual selected pages. You might then be able to extract the text with some OCR software. -- Mike Clarke ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Sat, Oct 13, 2012 at 11:15:36PM +0200, Polytropon wrote: > On Sat, 13 Oct 2012 13:47:01 -0700, Gary Kline wrote: > > On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote: > > > On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote: > > > > > > The disassembling can be done with > > > > > > % pdfimages source.pdf . > > > > > > Then the files can be edited whatever tool you like, e. g. Gimp. > > > They often come out in PBM format. > > > > > > > > > A qstn I should have asked last time. this book is a history or > > bio of richland county, ohio:: in type, it's like 650 or more > > pages. SO: Is pdfimages going to spit of 6t50 files? as noted > > in last email, only a couple of these images are of any interest > > Depends on what actually _is_ in the PDF file. If every page is > represented as a picture, 650 pictures will be created. If it > contains text _and_ images, the images will be output, if will > _only_ output the images, with no real realtion to where they > have been placed in the text. As suggested by the name "pdfimages" > it takes the images from the PDF file. :-) > > The easiest way to check for possible text is to install xpdf > which brings the binary "pdftotext" (if I remember correctly that > this tool is in _that_ package). You can then use it like this: > > % pdftotext source.pdf > > It will create "source.txt" with all actual text (but of course > without _any_ formatting except line breaks and ^L page breaks), > including page numbers. But hey, it's pure ASCII text suitable > for further processing. :-) > > Run "pdftotext" without parameters for a short summary of its > parameters; "man pdftotext" is also provided. > Well, then my original instincts were right. I ran the pdftotext and nothing but the page numbers were there. rats. oh-well, at least I can type in byhhand what I want:) > > -- > Polytropon > Magdeburg, Germany > Happy FreeBSD user since 4.0 > Andra moi ennepe, Mousa, ... -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Sat, 13 Oct 2012 13:38:16 -0700, Gary Kline wrote: > On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote: > > On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote: > > > ive got a question that fits in here. hopefully. > > > > > > last week I found a book from 1901 that google had scanned and listed > > > as a pdf file. it was text plus photos of the rich/famous of the > > > 1800s. somehow, google found the exact string that matched my great > > > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) > > > and searched using acroread. nada. I used the pdftotext utility. > > > same: nothing but some 600 page numbers. > > > > > > my guess is that google just took photos of the book and used other > > > tools to create a pdf file. I am not =that= serious about genealogy, > > > but I would like to know if there are any tools to edit this kind of > > > pdf file. > > > > In case the PDF is nothing more than a compilation of images, > > there's a way to deal with it for editing: > > > the images in this book aren't what I am interested in. > just text. In case the text is "in" images (i. e. the images contain text), postprocessing those images will be the only way to obtain the text information (if there is no actual text in the PDF). > what fmt works best with the ocr suites? or are they about the > same? for the section I got in that 1901 book on my g-grandfather, > it was only about 1.5 pages. there was no photo, just his name > and some bio. Still, things I had no knowledge of. I'm sure > that my father didnt know either! It should work with any lossless (!) format, especially if it does only contain two colors (as any BW format of PBM, GIF and PNG can do, and JPEG can't). In case tesseract OCR does not operate on PBM files directly, convert them into something it can handle better, like TIFF or maybe PNG; you can use % convert .-530.pbm 530.png % convert .-531.pbm 531.png manually (as you will only process two files) and then run the OCR process on them. Note that pdfimages can also output color images (if they are color images in the source), e. g. I found .-000.ppm (PPM format) with a diagram in "Good Ideas, Through the Looking Glass" by N. Wirth. I'm not sure if there could also "directly" be PNG or EPS files in a PDF file... -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ... ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Sat, 13 Oct 2012 13:47:01 -0700, Gary Kline wrote: > On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote: > > On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote: > > > > The disassembling can be done with > > > > % pdfimages source.pdf . > > > > Then the files can be edited whatever tool you like, e. g. Gimp. > > They often come out in PBM format. > > > > > A qstn I should have asked last time. this book is a history or > bio of richland county, ohio:: in type, it's like 650 or more > pages. SO: Is pdfimages going to spit of 6t50 files? as noted > in last email, only a couple of these images are of any interest Depends on what actually _is_ in the PDF file. If every page is represented as a picture, 650 pictures will be created. If it contains text _and_ images, the images will be output, if will _only_ output the images, with no real realtion to where they have been placed in the text. As suggested by the name "pdfimages" it takes the images from the PDF file. :-) The easiest way to check for possible text is to install xpdf which brings the binary "pdftotext" (if I remember correctly that this tool is in _that_ package). You can then use it like this: % pdftotext source.pdf It will create "source.txt" with all actual text (but of course without _any_ formatting except line breaks and ^L page breaks), including page numbers. But hey, it's pure ASCII text suitable for further processing. :-) Run "pdftotext" without parameters for a short summary of its parameters; "man pdftotext" is also provided. -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ... ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote: > On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote: > > The disassembling can be done with > > % pdfimages source.pdf . > > Then the files can be edited whatever tool you like, e. g. Gimp. > They often come out in PBM format. > A qstn I should have asked last time. this book is a history or bio of richland county, ohio:: in type, it's like 650 or more pages. SO: Is pdfimages going to spit of 6t50 files? as noted in last email, only a couple of these images are of any interest -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Sat, Oct 13, 2012 at 01:19:07PM +0200, Polytropon wrote: > On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote: > > ive got a question that fits in here. hopefully. > > > > last week I found a book from 1901 that google had scanned and listed > > as a pdf file. it was text plus photos of the rich/famous of the > > 1800s. somehow, google found the exact string that matched my great > > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) > > and searched using acroread. nada. I used the pdftotext utility. > > same: nothing but some 600 page numbers. > > > > my guess is that google just took photos of the book and used other > > tools to create a pdf file. I am not =that= serious about genealogy, > > but I would like to know if there are any tools to edit this kind of > > pdf file. > > In case the PDF is nothing more than a compilation of images, > there's a way to deal with it for editing: the images in this book aren't what I am interested in. just text. > > step 1: disassemble > step 2: edit images > step 3: reassemble > > The disassembling can be done with > > % pdfimages source.pdf . > > Then the files can be edited whatever tool you like, e. g. Gimp. > They often come out in PBM format. > > Finally the images can be re-converted to PDF and combined to one > PDF file: > > for IMG in .*.pbm; do > convert ${IMG} ${IMG}.pdf > done > pdftk .*.pdf output target.pdf > > Note the ".*" prefix for the file specification: The images extracted > by pdfimages match that pattern (at least in the case I tested it for). > If they get other names than .001.pbm, change the approach > accordingly. > turns out that the first roughtly 580 pages are of no interest. I'll see if tesseract-ocr can get rid of most of the data. what fmt works best with the ocr suites? or are they about the same? for the section I got in that 1901 book on my g-grandfather, it was only about 1.5 pages. there was no photo, just his name and some bio. Still, things I had no knowledge of. I'm sure that my father didnt know either! gary > > > -- > Polytropon > Magdeburg, Germany > Happy FreeBSD user since 4.0 > Andra moi ennepe, Mousa, ... -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Sat, Oct 13, 2012 at 04:40:23AM +0200, C. P. Ghost wrote: > On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline wrote: > > On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote: > >> 10.10.2012 02:35, Gary Aitken пишет: > >> > >> > Can someone give me advice on editing pdf files? > >> > >> Take a look at graphics/inkscape. > >> > >> -- > >> WBR, Boris Samorodov (bsam) > >> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve > > > > > > ive got a question that fits in here. hopefully. > > > > last week I found a book from 1901 that google had scanned and > > listed > > as a pdf file. it was text plus photos of the rich/famous of the > > 1800s. somehow, google found the exact string that matched my great > > grandfather [from the civil war]. I d'loaded the file (maybe > > 2mbytes) > > and searched using acroread. nada. I used the pdftotext utility. > > same: nothing but some 600 page numbers. > > > > my guess is that google just took photos of the book and used other > > tools to create a pdf file. I am not =that= serious about > > genealogy, > > but I would like to know if there are any tools to edit this kind of > > pdf file. > > I suspect the following: they scanned the book and put all the images > into the PDF. The PDF itself is merely a container for scanned pages; > it thus contains no text (save for the page numbers). > > That Google was able to search in this file is probably due to them running > some OCR program on the image files, and then indexing the (approximate) > text that the OCR program generated. Probably they used something like > tesseract-ocr from ports graphics/tesseract: > http://code.google.com/p/tesseract-ocr/ > in more recent google stuff--text--sci-tech zines or whatever--it sseems like they have used some very high-end ocr programs and =then= turned the file into pdf. I have been able to get very good textfiles from a small sample of google's work. a few years ago I tried the ocr ports we have. very poor results. it may be time to see if the newer versions gives me better results. gary ps: tesseract was one I tried [circa '10] ... time to look at the actual Code! > > -cpghost. > > -- > Cordula's Web. http://www.cordula.ws/ -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Fri, 12 Oct 2012 16:46:28 -0700, Gary Kline wrote: > ive got a question that fits in here. hopefully. > > last week I found a book from 1901 that google had scanned and listed > as a pdf file. it was text plus photos of the rich/famous of the > 1800s. somehow, google found the exact string that matched my great > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) > and searched using acroread. nada. I used the pdftotext utility. > same: nothing but some 600 page numbers. > > my guess is that google just took photos of the book and used other > tools to create a pdf file. I am not =that= serious about genealogy, > but I would like to know if there are any tools to edit this kind of > pdf file. In case the PDF is nothing more than a compilation of images, there's a way to deal with it for editing: step 1: disassemble step 2: edit images step 3: reassemble The disassembling can be done with % pdfimages source.pdf . Then the files can be edited whatever tool you like, e. g. Gimp. They often come out in PBM format. Finally the images can be re-converted to PDF and combined to one PDF file: for IMG in .*.pbm; do convert ${IMG} ${IMG}.pdf done pdftk .*.pdf output target.pdf Note the ".*" prefix for the file specification: The images extracted by pdfimages match that pattern (at least in the case I tested it for). If they get other names than .001.pbm, change the approach accordingly. -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ... ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline wrote: > On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote: >> 10.10.2012 02:35, Gary Aitken пишет: >> >> > Can someone give me advice on editing pdf files? >> >> Take a look at graphics/inkscape. >> >> -- >> WBR, Boris Samorodov (bsam) >> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve > > > ive got a question that fits in here. hopefully. > > last week I found a book from 1901 that google had scanned and listed > as a pdf file. it was text plus photos of the rich/famous of the > 1800s. somehow, google found the exact string that matched my great > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) > and searched using acroread. nada. I used the pdftotext utility. > same: nothing but some 600 page numbers. > > my guess is that google just took photos of the book and used other > tools to create a pdf file. I am not =that= serious about genealogy, > but I would like to know if there are any tools to edit this kind of > pdf file. I suspect the following: they scanned the book and put all the images into the PDF. The PDF itself is merely a container for scanned pages; it thus contains no text (save for the page numbers). That Google was able to search in this file is probably due to them running some OCR program on the image files, and then indexing the (approximate) text that the OCR program generated. Probably they used something like tesseract-ocr from ports graphics/tesseract: http://code.google.com/p/tesseract-ocr/ > tia guys, > > gary > > > -- > Gary Kline kl...@thought.org http://www.thought.org Public Service Unix > Twenty-six years of service to the Unix community. -cpghost. -- Cordula's Web. http://www.cordula.ws/ ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote: > 10.10.2012 02:35, Gary Aitken пишет: > > > Can someone give me advice on editing pdf files? > > Take a look at graphics/inkscape. > > -- > WBR, Boris Samorodov (bsam) > FreeBSD Committer, http://www.FreeBSD.org The Power To Serve ive got a question that fits in here. hopefully. last week I found a book from 1901 that google had scanned and listed as a pdf file. it was text plus photos of the rich/famous of the 1800s. somehow, google found the exact string that matched my great grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes) and searched using acroread. nada. I used the pdftotext utility. same: nothing but some 600 page numbers. my guess is that google just took photos of the book and used other tools to create a pdf file. I am not =that= serious about genealogy, but I would like to know if there are any tools to edit this kind of pdf file. tia guys, gary -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix Twenty-six years of service to the Unix community. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
10.10.2012 02:35, Gary Aitken пишет: > Can someone give me advice on editing pdf files? Take a look at graphics/inkscape. -- WBR, Boris Samorodov (bsam) FreeBSD Committer, http://www.FreeBSD.org The Power To Serve ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: editing pdf files
On 9 October 2012 18:35, Gary Aitken wrote: > Can someone give me advice on editing pdf files? > I have some forms I'm trying to fill out that contain labels followed by > a bunch of underline characters and all I need to do is delete some of the > underlines and add text. > > I tried using pdfedit but the modified file doesn't display properly. I > assumed > it would take care of text placement, etc., but it did not produce readable > results. For example, removing underline characters and replacing them with > "340007174" only displays "3400014"; "75-300mm" displays as "-300mm". Am I > missing something or does it just not work? > > Does anyone have any experience with the open office oracle pdf import > extension? > I don't see it in the ports collection, > and I am not sure whether to even try using the linux version. > That a file is a .pdf doesn't tell you much about it, actually. If the .pdf is just an image (eg a scanned document) you can perhaps import it into gimp, or export it as a .png. You might be able to OCR it, I don't know. If it's a proper text document saved as a .pdf, you can surely import it into open- or libre-office & edit it from there. You might also be able to export it as a .ps (using perhaps print/gv or graphics/xpdf, or perhaps one of the tools from graphics/poppler) & edit it with an even wider variety of tools. Yet another option would be to use google docs (or similar), which generally allows viewing & perhaps editing as an HTML document. I don't know if google docs allows re-export back to .pdf, though. -- -- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
editing pdf files
Can someone give me advice on editing pdf files? I have some forms I'm trying to fill out that contain labels followed by a bunch of underline characters and all I need to do is delete some of the underlines and add text. I tried using pdfedit but the modified file doesn't display properly. I assumed it would take care of text placement, etc., but it did not produce readable results. For example, removing underline characters and replacing them with "340007174" only displays "3400014"; "75-300mm" displays as "-300mm". Am I missing something or does it just not work? Does anyone have any experience with the open office oracle pdf import extension? I don't see it in the ports collection, and I am not sure whether to even try using the linux version. Thanks, Gary ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"