MB, having the text would be way more useful than the PDF pages! Thanks for recommending pdftotext and the -layout option.

I have some questions -- could you help me break this process down into smaller steps?

I looked up pdfjam's split command online -- I think that it may be a little time consuming (my PDFs are a few thousand pages long):

http://0x2a.at/blog/2011/02/pdf_manipulation_on_the_cli/

http://tex.stackexchange.com/questions/79623/quickly-extracting-individual-pages-from-a-document

I looked at PDF Shuffler (the GUI one) and that can only split files one-by-one. Are there other options?


Once I split the files into single pages, I'll need the Shell command 'for file in pages/*" loop. I don't understand what this step will do. Could you please explain this step too?

About this step: 'if pdftotex "$file" - | grep -i regexps' -- does this copy all the PDF text to one text file? And then search (grep) the text file? Does this command take text from many single PDfs? Or only after the "hit" pages are joined up into one document?

What does it mean to "append the file to a Shell variable" ? What is the goal in this step? Could you please explain how I can do this step too?

Reply via email to