Hi guys, I've downloaded PDFBox and am very impressed with its functionality given that it's not even at version 1 yet!
I wondered if someone could help me extract text from a PDF with formatting however. I've looked at the TextStripper class, and it does seem great for ripping plain text from the PDF, however I would like to have some basic formatting retained. I don't care about images or text placement, just relative text sizes and fonts. For example, using a commercial tool I managed to get: <TEXT> <p><font face="Swiss721SWA" style="font-size:10pt;font-style:Bold">BODY DIMENSIONS <SpaceCount space="65" />1I-1</font></p> <p><font face="Swiss721SWA" style="font-size:9pt">Three-dimensional <SpaceCount space="12" />Center-1o-center <SpaceCount space="14" /></font><font face="Swiss721SWA" style="font-size:16pt;font-style:Bold">GENERAL INFORMATION</font></p> </TEXT> This is good enough for me as I do not care about layout, and I can reason from the font size that General Information is probably a title, and Body Dimensions is a sub-title. My question to you guys is, how can I get this with PDFBox? I can see that you have some classes that seem to support fonts and size of text, but I can't see any examples that use them and i see any route to getting to them from the base PDDocument. Any help is greatly appreciated!
