On Wed, Nov 22, 2017 at 5:39 PM, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
> On 2017-11-21, Daniel Gross <gross...@gmail.com> wrote: > > I am new to python and jumped right into trying to read out (english) > text > > from PDF files. > > That's not a trivial task. However I just released pycpdf, which might > help you out. Check out https://github.com/jribbens/pycpdf which shows > an example of extracting text from PDFs. It may or may not cope with > the particular PDFs you have, as there's quite a lot of variety within > the format. > > Example: > > pdf = pycpdf.PDF(open("file.pdf", "rb").read()) > if pdf.info and pdf.info.get('Title'): > print('Title:', pdf.info['Title']) > for pageno, page in enumerate(pdf.pages): > print('Page', pageno + 1) > print(page.text) > -- > https://mail.python.org/mailman/listinfo/python-list > Sorry if I'm late to this party, but I use pdf2txt for this. Works just fine. It has options for different encodings, page range, etc. On Linux just "apt install python-pdfminer" to install. -- **** Listen to my FREE CD at http://www.mellowood.ca/music/cedars **** Bob van der Poel ** Wynndel, British Columbia, CANADA ** EMAIL: b...@mellowood.ca WWW: http://www.mellowood.ca -- https://mail.python.org/mailman/listinfo/python-list