I am trying to parse some specific paragraphs from PDFs. I first tried to convert the PDF to html but that created a lot of p tags that seemed to have absolutely no correlation to the actual paragraphs in my PDF.
Each paragraph has a header that is in a different font as well as being bold. Is there any way to grab text based on the font used? I was thinking I could grab all the text between 2 lines of text with the specific font/weight information. Is this possible? Otherwise can anyone recommend another way to go about grabbing specific paragraphs from a PDF? I have a few thousand PDFs with a paragraph that has a header of 'Summary'. I'd like to pull out the paragraphs associated with the summary and display them on the web. Thanks! Jeremy

