Re: Looking Python script to compare two files
Thanks Tim! I will have a try,maybe this weekend and let you know the result. -- http://mail.python.org/mailman/listinfo/python-list
RE: Looking Python script to compare two files
[david] > So if I want to use these tools: antiword,pdf2text, > can I pack these tools and python script into a > windows EXE file? I know there is open source tool > which can pack python script and libs and generate > the windows EXE file. I'm not especially qualified to answer this, but I think the answer's Yes. I think that you can just tell py2exe that the executables and DLLs of the other products are data files for the Python one. Best look at the py2exe site and mailing list for further info. An alternative is just to use an installer to package the whole thing in the usual Windows way. > Yes, this approach can't handle the pictures in > the PDF/WORD file. There is a way to play around > it? maybe it's very hard. I'm not even sure how I'd go about it conceptually. How *do* you compare two pictures? Do you really want to do this? BTW, don't forget that if you're comparing Word with Word, you can use its inbuilt comparison ability, which just needs COM automation. (Don't know how that takes care of picture either, but if Word's own Compare can't, no-one else has got a chance). TJG This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk -- http://mail.python.org/mailman/listinfo/python-list
Re: Looking Python script to compare two files
Thanks for the quick replies! So if I want to use these tools: antiword,pdf2text, can I pack these tools and python script into a windows EXE file? I know there is open source tool which can pack python script and libs and generate the windows EXE file. Yes, this approach can't handle the pictures in the PDF/WORD file. There is a way to play around it? maybe it's very hard. Regards -- http://mail.python.org/mailman/listinfo/python-list
Re: Looking Python script to compare two files
Tim Golden wrote: > + PDF: David Boddie's pdftools looks like about the only possibility: > (ducks as a thousand people jump on him and point out the alternatives) I might as well do that! Here are a couple of alternatives: http://www.sourceforge.net/projects/pdfplayground http://www.adaptive-enterprises.com.au/~d/software/pdffile/ Both of these are arguably more "Pythonic" than my solution, and the first is also able to write out modified files. Cameron Laird also maintains a page about PDF conversion tools: http://phaseit.net/claird/comp.text.pdf/PDF_converters.html > http://www.boddie.org.uk/david/Projects/Python/pdftools/ > > Something like this might do the business. I'm afraid I've > no idea how to determine where the line-breaks are. This > was the first time I'd used pdftools, and the fact that > I could do this much is a credit to its usability! Thanks for the compliment! The read_text method in the PDFContents class also lets you extract text from a given page in a document, but you have to remember that text in PDF files isn't always composed as a series of lines or paragraphs, and often doesn't even contain whitespace characters. David -- http://mail.python.org/mailman/listinfo/python-list
RE: Looking Python script to compare two files
[david] > I want to compare PDF-PDF files and WORD-WORD files. OK. Well, that's clear enough. > It seems that the right way is : > First, extract text from PDF file or Word file. > Then, use Difflib to compare these text files. When you say "it seems that the right way is..." I'll assume that this way meets your requirements. It wouldn't be the right way if, for example, you wanted to treat different header levels as different, or to consider embedded graphics as significant etc. > Would you please give me some more information > about the external diff tools? Well, I could mention the name of the ones which I might use (WinMerge and GNU diff), but I'm sure there are many of then around the place, and you're far better off doing this: http://www.google.co.uk/search?q=diff+tools In case you didn't realise, the "difflib" I referred to is a Python module from the standard library: Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import difflib >>> `difflib` "" >>> > There some Python scripts that can extract text > from PDF or WORD file? Well, I'm sure there are, but my honest opinion is that, unless you've got some compelling reason to do this in Python, you're better off using, say: + antiword: http://www.winfield.demon.nl/ + pdf2text from xpdf: http://www.foolabs.com/xpdf/home.html If you really wanted to go with Python (for the learning experience, if nothing else) then the most obvious candidates are: + Word: use the pywin32 modules to automate Word and save the document as text: http://pywin32.sf.net/ Something like this (assumes doc called c:\temp\test.doc exists): import win32com.client word = win32com.client.gencache.EnsureDispatch ("Word.Application") doc = word.Documents.Open (FileName="c:/temp/test.doc") doc.SaveAs (FileName="c:/temp/test2.txt", FileFormat=win32com.client.constants.wdFormatText) word.Quit () del word text = open ("c:/temp/test2.txt").read () print text + PDF: David Boddie's pdftools looks like about the only possibility: (ducks as a thousand people jump on him and point out the alternatives) http://www.boddie.org.uk/david/Projects/Python/pdftools/ Something like this might do the business. I'm afraid I've no idea how to determine where the line-breaks are. This was the first time I'd used pdftools, and the fact that I could do this much is a credit to its usability! from pdftools.pdffile import PDFDocument from pdftools.pdftext import Text def contents_to_text (contents): for item in contents: if isinstance (item, type ([])): for i in contents_to_text (item): yield i elif isinstance (item, Text): yield item.text doc = PDFDocument ("c:/temp/test.pdf") n_pages = doc.count_pages () text = [] for n_page in range (1, n_pages+1): print "Page", n_page page = doc.read_page (n_page) contents = page.read_contents ().contents text.extend (contents_to_text (contents)) print "".join (text) TJG This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk -- http://mail.python.org/mailman/listinfo/python-list
Re: Looking Python script to compare two files
Hello Tim: One more thing: There some Python scripts that can extract text from PDF or WORD file? Thx -- http://mail.python.org/mailman/listinfo/python-list
Re: Looking Python script to compare two files
Hello Tim: One more thing: There some Python scripts that can extract text from PDF or WORD file? Thx -- http://mail.python.org/mailman/listinfo/python-list
Re: Looking Python script to compare two files
Hello Tim: One more thing: There some Python scripts that can extract text from PDF or WORD file? Thx -- http://mail.python.org/mailman/listinfo/python-list
Re: Looking Python script to compare two files
Hello Tim: Thanks for your reply! I want to compare PDF-PDF files and WORD-WORD files. It seems that the right way is : First, extract text from PDF file or Word file. Then, use Difflib to compare these text files. Would you please give me some more information about the external diff tools? Thx! -- http://mail.python.org/mailman/listinfo/python-list
RE: Looking Python script to compare two files
[yys2000] > I want to compare two PDF or WORD files. Could you be more precise, please? + Do you only want to compare PDF-PDF or Word-Word? Or do you want to be able to do PDF-Word? + In either case, are you only bothered about the text, or is the formatting significant? + If it's only text, then use whatever method you want to extract the text (antiword, ghostscript, COM automation, xpdf, etc.) and then use the difflib module, or some external diff tool. + If you want a structure/format comparison, you're into quite difficult territory, I believe. It's easy enough to convert a Word Doc to PDF if that were needed but PDFs are notoriously difficult to disentangle, altho' relatively straightforward to build. There's pdftools (http://www.boddie.org.uk/david/Projects/Python/pdftools/) which I can't say I've tried, but even once you've got the document object into Python, I don't imagine it'll be easy to compare. + To do Word-Word comparison, there's more hope on the horizon (if that's the metaphor I want). Word has built-in comparison functionality, and recent versions of TortoiseSVN, for example include a script which will automate Word to do the right thing. Which is, essentially, one doc, and call its .Compare method against the other. TJG This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk -- http://mail.python.org/mailman/listinfo/python-list
Looking Python script to compare two files
hi: I want to compare two PDF or WORD files. Any Help? thx -- http://mail.python.org/mailman/listinfo/python-list