Hi, I'm afraid there isn't any complete solution for your request. But perhaps as a start you will have a look at
org.apache.pdfbox.util.PDFText2HTML It is called if you call ExtractText with the parameter "-html". BR Andreas Lehmkühler spud schrieb: > Hi guys, > > I've downloaded PDFBox and am very impressed with its functionality > given that it's not even at version 1 yet! > > I wondered if someone could help me extract text from a PDF with > formatting however. I've looked at the TextStripper class, and it does > seem great for ripping plain text from the PDF, however I would like > to have some basic formatting retained. I don't care about images or > text placement, just relative text sizes and fonts. For example, using > a commercial tool I managed to get: > > <TEXT> > <p><font face="Swiss721SWA" > style="font-size:10pt;font-style:Bold">BODY DIMENSIONS <SpaceCount > space="65" />1I-1</font></p> > <p><font face="Swiss721SWA" style="font-size:9pt">Three-dimensional > <SpaceCount space="12" />Center-1o-center <SpaceCount space="14" > /></font><font face="Swiss721SWA" > style="font-size:16pt;font-style:Bold">GENERAL INFORMATION</font></p> > </TEXT> > > This is good enough for me as I do not care about layout, and I can > reason from the font size that General Information is probably a > title, and Body Dimensions is a sub-title. > > My question to you guys is, how can I get this with PDFBox? I can see > that you have some classes that seem to support fonts and size of > text, but I can't see any examples that use them and i see any route > to getting to them from the base PDDocument. > > Any help is greatly appreciated!
