Re: [Tutor] PDF to text conversion
The best converter so far is pdftotext from http://www.glyphandcog.com/ who maintain an open source project at http://www.foolabs.com/xpdf/. It's not a Python library but you can call pdftotext from with Python using os.system(). I used the pdftotext -layout option and that gave the best result. hth. dinesh Message: 4 Date: Tue, 21 Apr 2009 18:37:39 -0400 From: Robert Berman berma...@cfl.rr.com Subject: Re: [Tutor] PDF to text conversion To: tutor@python.org Message-ID: 49ee4ab3.4040...@cfl.rr.com Content-Type: text/plain; charset=ISO-8859-1; format=flowed First, thanks to everyone who contributed to this thread. I have a number of possible solutions and a number of paths to pursue to determine which avenue I should take to resolve this remaining issue. I did try the itools library and while everything installed nicely, most of the tests failed so I am not particularly overjoyed with the results. Thank you Dinesh for the vote of sympathy. I do appreciate it. I did use Adobe Reader to convert the history PDF file into a text file and it did seem to do it faithfully. So now I will work out a parsing function to extract my data and send it to a SQLLITE database. I am thrilled both with the number of suggestions I have received from this group and the quality of the suggestions. Thanks again, Robert Berman ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Dinesh, I have pdftotext version 3.0.0. I have decided to use this to go from PDF to text. It is not the ideal solution, but is is a certainly doable solution. Thank you, Robert Dinesh B Vadhia wrote: The best converter so far is pdftotext from http://www.glyphandcog.com/ who maintain an open source project at http://www.foolabs.com/xpdf/. It's not a Python library but you can call pdftotext from with Python using os.system(). I used the pdftotext -layout option and that gave the best result. hth. dinesh Message: 4 Date: Tue, 21 Apr 2009 18:37:39 -0400 From: Robert Berman berma...@cfl.rr.com mailto:berma...@cfl.rr.com Subject: Re: [Tutor] PDF to text conversion To: tutor@python.org mailto:tutor@python.org Message-ID: 49ee4ab3.4040...@cfl.rr.com mailto:49ee4ab3.4040...@cfl.rr.com Content-Type: text/plain; charset=ISO-8859-1; format=flowed First, thanks to everyone who contributed to this thread. I have a number of possible solutions and a number of paths to pursue to determine which avenue I should take to resolve this remaining issue. I did try the itools library and while everything installed nicely, most of the tests failed so I am not particularly overjoyed with the results. Thank you Dinesh for the vote of sympathy. I do appreciate it. I did use Adobe Reader to convert the history PDF file into a text file and it did seem to do it faithfully. So now I will work out a parsing function to extract my data and send it to a SQLLITE database. I am thrilled both with the number of suggestions I have received from this group and the quality of the suggestions. Thanks again, Robert Berman ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Robert Berman wrote: Dinesh, I have pdftotext version 3.0.0. I have decided to use this to go from PDF to text. It is not the ideal solution, but is is a certainly doable solution. Thank you, Robert Dinesh B Vadhia wrote: The best converter so far is pdftotext from http://www.glyphandcog.com/ who maintain an open source project at http://www.foolabs.com/xpdf/. It's not a Python library but you can call pdftotext from with Python using os.system(). I used the pdftotext -layout option and that gave the best result. hth. dinesh You can use subprocess; #!/usr/bin/python from subprocess import call call(['pdftotext', 'test.pdf']) -david -- Powered by Gentoo GNU/Linux http://linuxcrazy.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] PDF to text conversion
Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. I do not think i am breaking any new frontiers with this application. Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. My development environment is: Python Linux Ubuntu version 8.10 Thanks for any help you might be able to offer. Robert Berman ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Robert Berman wrote: Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. I do not think i am breaking any new frontiers with this application. Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. If this is a one-time conversion just use the save as text feature of adobe reader. My development environment is: Python Linux Ubuntu version 8.10 Thanks for any help you might be able to offer. Robert Berman ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
On Tue, Apr 21, 2009 at 12:54 PM, bob gailer bgai...@gmail.com wrote: Robert Berman wrote: Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. I do not think i am breaking any new frontiers with this application. Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. If this is a one-time conversion just use the save as text feature of adobe reader. My development environment is: Python Linux Ubuntu version 8.10 Thanks for any help you might be able to offer. Robert Berman ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor I tried pyPdf once, just for fun, and it was nice: http://pybrary.net/pyPdf/ -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد الغزالي No victim has ever been more repressed and alienated than the truth Emad Soliman Nawfal Indiana University, Bloomington ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Bob, Thank you for the quick reply. I am acquainted with that method, and that will certainly work to do some really serious testing; but, the data collection is an ongoing process and the users are requesting that every month the latest entries (8) are brought into the system. What is rather irksome is that the output from the system cannot be changed from PDF to text; so obviously I am going to have to resolve the situation at my end. I am envisioning a simple program that once started reads the data file, converts the data into text, and then sends the data to the database. The program doesn't care if there are 8 test results or 80,000 test results. That is why i am looking for a python module. Thanks again, Robert Berman bob gailer wrote: Robert Berman wrote: Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. I do not think i am breaking any new frontiers with this application. Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. If this is a one-time conversion just use the save as text feature of adobe reader. My development environment is: Python Linux Ubuntu version 8.10 Thanks for any help you might be able to offer. Robert Berman ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Hello Emad, I have seriously looked at the documentation associated with pyPDF. This seems to have the page as its smallest element of work, and what i need is a line by line process to go from .PDF format to Text. I don't think pyPDF will meet my needs but thank you for bringing it to my attention. Thanks, Robert Berman Emad Nawfal (عماد نوفل) wrote: On Tue, Apr 21, 2009 at 12:54 PM, bob gailer bgai...@gmail.com mailto:bgai...@gmail.com wrote: Robert Berman wrote: Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. I do not think i am breaking any new frontiers with this application. Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. If this is a one-time conversion just use the save as text feature of adobe reader. My development environment is: Python Linux Ubuntu version 8.10 Thanks for any help you might be able to offer. Robert Berman ___ Tutor maillist - Tutor@python.org mailto:Tutor@python.org http://mail.python.org/mailman/listinfo/tutor -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org mailto:Tutor@python.org http://mail.python.org/mailman/listinfo/tutor I tried pyPdf once, just for fun, and it was nice: http://pybrary.net/pyPdf/ -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد الغزالي No victim has ever been more repressed and alienated than the truth Emad Soliman Nawfal Indiana University, Bloomington ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Robert Berman wrote: snip Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. My development environment is: Python Linux Ubuntu version 8.10 I've used [r...@fcfw2 /]# /usr/bin/pdftotext -v pdftotext version 2.01 Copyright 1996-2002 Glyph Cog, LLC [r...@fcfw2 /]# cat /etc/issue Red Hat Linux release 9 (Shrike) HTH, Emile ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
On Tuesday 21 April 2009 10:36:59 am Robert Berman wrote: Bob, Thank you for the quick reply. I am acquainted with that method, and that will certainly work to do some really serious testing; but, the data collection is an ongoing process and the users are requesting that every month the latest entries (8) are brought into the system. What is rather irksome is that the output from the system cannot be changed from PDF to text; so obviously I am going to have to resolve the situation at my end. I am envisioning a simple program that once started reads the data file, converts the data into text, and then sends the data to the database. The program doesn't care if there are 8 test results or 80,000 test results. That is why i am looking for a python module. Thanks again, Robert Berman bob gailer wrote: Robert Berman wrote: Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. I do not think i am breaking any new frontiers with this application. Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. If this is a one-time conversion just use the save as text feature of adobe reader. My development environment is: Python Linux Ubuntu version 8.10 Thanks for any help you might be able to offer. Robert Berman On linux pdftotext is available and you might want to check out ghostscript which runs on windows and linux. -- John Fabiani ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Hi Robert I don't have an answer but can have my sympathy. I've been looking for a quality pdf to text convertor for months and not turned up anything useful. I've tried many free programs which are poor. I too wanted a Python-only solution and tried pyPdf but that didn't work. Just today I download a trial version of a so called top-notch converter and it produced unfaithful text. Not sure what the answer is! Dinesh Message: 5 Date: Tue, 21 Apr 2009 13:44:16 -0400 From: Robert Berman berma...@cfl.rr.com Subject: Re: [Tutor] PDF to text conversion To: Emad Nawfal ( ) emadnaw...@gmail.com Cc: tutor@python.org Message-ID: 49ee05f0.3080...@cfl.rr.com Content-Type: text/plain; charset=windows-1256; format=flowed Hello Emad, I have seriously looked at the documentation associated with pyPDF. This seems to have the page as its smallest element of work, and what i need is a line by line process to go from .PDF format to Text. I don't think pyPDF will meet my needs but thank you for bringing it to my attention. Thanks, Robert Berman ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
On Tuesday 21 April 2009 11:05:32 am Dinesh B Vadhia wrote: Hi Robert I don't have an answer but can have my sympathy. I've been looking for a quality pdf to text convertor for months and not turned up anything useful. I've tried many free programs which are poor. I too wanted a Python-only solution and tried pyPdf but that didn't work. Just today I download a trial version of a so called top-notch converter and it produced unfaithful text. Not sure what the answer is! Dinesh --- - Message: 5 Date: Tue, 21 Apr 2009 13:44:16 -0400 From: Robert Berman berma...@cfl.rr.com Subject: Re: [Tutor] PDF to text conversion To: Emad Nawfal ( ) emadnaw...@gmail.com Cc: tutor@python.org Message-ID: 49ee05f0.3080...@cfl.rr.com Content-Type: text/plain; charset=windows-1256; format=flowed Hello Emad, I have seriously looked at the documentation associated with pyPDF. This seems to have the page as its smallest element of work, and what i need is a line by line process to go from .PDF format to Text. I don't think pyPDF will meet my needs but thank you for bringing it to my attention. Thanks, Robert Berman Have you tried going from a PDF to PS and then to text? -- John Fabiani ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Emile van Sebille wrote: Robert Berman wrote: snip Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. My development environment is: Python Linux Ubuntu version 8.10 I've used [r...@fcfw2 /]# /usr/bin/pdftotext -v pdftotext version 2.01 Copyright 1996-2002 Glyph Cog, LLC [r...@fcfw2 /]# cat /etc/issue Red Hat Linux release 9 (Shrike) HTH, Emile ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor Hi Robert, pdftotext is part of poppler-utils, an Ubuntu package which can be installed like so: sudo aptitude install poppler-utils But I to would be interested in finding a python library/module for this. Regards, Dayo ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
Robert Berman wrote: Hello Emad, I have seriously looked at the documentation associated with pyPDF. This seems to have the page as its smallest element of work, and what i need is a line by line process to go from .PDF format to Text. I don't think pyPDF will meet my needs but thank you for bringing it to my attention. Thanks, Robert Berman Have you looked at pdfminer? http://www.unixuser.org/~euske/python/pdfminer/index.html Looks promising. HTH, Marty Emad Nawfal (عماد نوفل) wrote: On Tue, Apr 21, 2009 at 12:54 PM, bob gailer bgai...@gmail.com mailto:bgai...@gmail.com wrote: Robert Berman wrote: Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. I do not think i am breaking any new frontiers with this application. Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. If this is a one-time conversion just use the save as text feature of adobe reader. My development environment is: Python Linux Ubuntu version 8.10 Thanks for any help you might be able to offer. Robert Berman ___ Tutor maillist - Tutor@python.org mailto:Tutor@python.org http://mail.python.org/mailman/listinfo/tutor -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org mailto:Tutor@python.org http://mail.python.org/mailman/listinfo/tutor I tried pyPdf once, just for fun, and it was nice: http://pybrary.net/pyPdf/ -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد الغزالي No victim has ever been more repressed and alienated than the truth Emad Soliman Nawfal Indiana University, Bloomington ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
bob gailer wrote: Robert Berman wrote: Hi, I must convert a history file in PDF format that goes from May of 1988 to current date. Readings are taken twice weekly and consist of the date taken mm/dd/yy and the results appearing as a 10 character numeric + special characters sequence. This is obviously an easy setup for a very small database application with the date as the key, the result string as the data. My problem is converting the PDF file into a text file which I can then read and process. I do not see any free python libraries having this capacity. I did see a PDFPILOT program for Windows but this application is being developed on Linux and should also run on Windows; so I do not want to incorporate a Windows only application. How about pyPdf; http://pybrary.net/pyPdf/ And an example; http://code.activestate.com/recipes/511465/ -david -- Powered by Gentoo GNU/Linux http://linuxcrazy.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
the itools library from hforge.org has a PDF2TEXT implementation itools.pdf http://www.hforge.org/itools norman On Tue, Apr 21, 2009 at 8:44 PM, Dayo Adewunmi contactd...@gmail.com wrote: Emile van Sebille wrote: Robert Berman wrote: snip Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. My development environment is: Python Linux Ubuntu version 8.10 I've used [r...@fcfw2 /]# /usr/bin/pdftotext -v pdftotext version 2.01 Copyright 1996-2002 Glyph Cog, LLC [r...@fcfw2 /]# cat /etc/issue Red Hat Linux release 9 (Shrike) HTH, Emile ___ Tutor maillist - tu...@python.org http://mail.python.org/mailman/listinfo/tutor Hi Robert, pdftotext is part of poppler-utils, an Ubuntu package which can be installed like so: sudo aptitude install poppler-utils But I to would be interested in finding a python library/module for this. Regards, Dayo ___ Tutor maillist - tu...@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] PDF to text conversion
First, thanks to everyone who contributed to this thread. I have a number of possible solutions and a number of paths to pursue to determine which avenue I should take to resolve this remaining issue. I did try the itools library and while everything installed nicely, most of the tests failed so I am not particularly overjoyed with the results. Thank you Dinesh for the vote of sympathy. I do appreciate it. I did use Adobe Reader to convert the history PDF file into a text file and it did seem to do it faithfully. So now I will work out a parsing function to extract my data and send it to a SQLLITE database. I am thrilled both with the number of suggestions I have received from this group and the quality of the suggestions. Thanks again, Robert Berman Norman Khine wrote: the itools library from hforge.org has a PDF2TEXT implementation itools.pdf http://www.hforge.org/itools norman On Tue, Apr 21, 2009 at 8:44 PM, Dayo Adewunmi contactd...@gmail.com wrote: Emile van Sebille wrote: Robert Berman wrote: snip Have any of you worked with such a library, or do you know of one or two I can download and work with? Hopefully, they have reasonable documentation. My development environment is: Python Linux Ubuntu version 8.10 I've used [r...@fcfw2 /]# /usr/bin/pdftotext -v pdftotext version 2.01 Copyright 1996-2002 Glyph Cog, LLC [r...@fcfw2 /]# cat /etc/issue Red Hat Linux release 9 (Shrike) HTH, Emile ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor Hi Robert, pdftotext is part of poppler-utils, an Ubuntu package which can be installed like so: sudo aptitude install poppler-utils But I to would be interested in finding a python library/module for this. Regards, Dayo ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor