Re: Python, Perl & PDF files

Cameron Laird Wed, 27 Apr 2005 16:10:06 -0700

In article <[EMAIL PROTECTED]>,
rbt  <[EMAIL PROTECTED]> wrote:
>Cameron Laird wrote:
>> In article <[EMAIL PROTECTED]>,
>> rbt  <[EMAIL PROTECTED]> wrote:
>>                      .
>>                      .
>>                      .
>> 
>>>Read and search them for strings. If I could do that on windows, linux 
>>>and mac with the *same* bit of Python code, I'd be very happy ;)
>> 
>> 
>> Textual content, right?  Without regard to font funniness, or
>> whether the string is in or out of a table, and so on?
>
>That's right. More specifically, I've written a script that uses a RE to 
>search 
>through documents for social security numbers. You can see it here:
>
>http://filebox.vt.edu/users/rtilley/public/find_ssns/find_ssns.html
>
>This works on Word, Excel, html, rtf or any ANSI based text. I need the
>ability to 
>read and make sense of PDF files as well so I can apply the RE to their
>content. It's 
>been frustrating to say the least. Nothing at all against Python...
>mostly just sick 
>of hearing about the 'Portable' document format that isn't string or RE
>searchable... 
>at least not easily anyway.
                        .
                        .
                        .
PDF is NOT easy to search.  'Fact, many times it's not even feasible,
in any automated sense.


When I can make time, I want to look into your Word and Excel searching;
there are several tricks for doing these in full generality.

Unless I've missed late-breaking news, Perl does NOT help, despite the
flashy appearance of the CPAN search page you referenced.  None of that
stuff gets at content in a sense that'll serve you well.

Neither does anything open-sourced in Python.  The best I know is what
I'm slowly documenting at <URL:
http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >,
as David mentioned earlier.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python, Perl & PDF files

Reply via email to