In article <[EMAIL PROTECTED]>, rbt <[EMAIL PROTECTED]> wrote: >Cameron Laird wrote: >> In article <[EMAIL PROTECTED]>, >> rbt <[EMAIL PROTECTED]> wrote: >> . >> . >> . >> >>>Read and search them for strings. If I could do that on windows, linux >>>and mac with the *same* bit of Python code, I'd be very happy ;) >> >> >> Textual content, right? Without regard to font funniness, or >> whether the string is in or out of a table, and so on? > >That's right. More specifically, I've written a script that uses a RE to >search >through documents for social security numbers. You can see it here: > >http://filebox.vt.edu/users/rtilley/public/find_ssns/find_ssns.html > >This works on Word, Excel, html, rtf or any ANSI based text. I need the >ability to >read and make sense of PDF files as well so I can apply the RE to their >content. It's >been frustrating to say the least. Nothing at all against Python... >mostly just sick >of hearing about the 'Portable' document format that isn't string or RE >searchable... >at least not easily anyway. . . . PDF is NOT easy to search. 'Fact, many times it's not even feasible, in any automated sense.
When I can make time, I want to look into your Word and Excel searching; there are several tricks for doing these in full generality. Unless I've missed late-breaking news, Perl does NOT help, despite the flashy appearance of the CPAN search page you referenced. None of that stuff gets at content in a sense that'll serve you well. Neither does anything open-sourced in Python. The best I know is what I'm slowly documenting at <URL: http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >, as David mentioned earlier. -- http://mail.python.org/mailman/listinfo/python-list