Read and extract text from pdf

2006-04-21 Thread Julien ARNOUX
Hi,
I have a problem :), I just want to extract text from pdf file with
python. There is differents libraries for that but it doesn't work...

pyPdf and  pdfTools, I don't know why but it doesn't works with some
pdf... For example space chars are delete in the text..
Pdf playground : I don't understand how it work.

If you have an idea, a tutorial, a library or anything who can help me
to do that.

-- 
http://mail.python.org/mailman/listinfo/python-list


Read and extract text from pdf

2006-04-24 Thread Julien ARNOUX
Hi,
Thanks I use that and is all right :)

import commands
txt = commands.getoutput('ps2ascii tmp.pdf')
print txt 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Read and extract text from pdf

2006-04-21 Thread Rene Pijlman
Julien ARNOUX:
>I have a problem :), I just want to extract text from pdf file with
>python. There is differents libraries for that but it doesn't work...
>
>pyPdf and  pdfTools, I don't know why but it doesn't works with some
>pdf...

Text can be represented in different ways in PDF: as tagged text, bitmap
and vector images, and even algorithms (IIRC). Most tools will only be
able to retrieve text represented as tagged text. So some tools may work
on some texts in some files and fail on others.

-- 
René Pijlman

Wat wil jij leren?  http://www.leren.nl
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Read and extract text from pdf

2006-04-21 Thread avishay
You can use Ghostscript for that purpose. Look at ps2ascii script (or
batch file) in the Ghostscript distribution. You can either call
Ghostscript from command line or use its DLL (don't know if Python
binding already exists...). The limitations the previous author has
mentioned, however, still apply.

Avishay

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Read and extract text from pdf

2006-04-21 Thread Jim
There is a pdftotext executable, at least on Linux.

-- 
http://mail.python.org/mailman/listinfo/python-list