Hi,

I'm afraid there isn't any complete solution for your request. But
perhaps as a start you will have a look at

org.apache.pdfbox.util.PDFText2HTML

It is called if you call ExtractText with the parameter "-html".

BR
Andreas Lehmkühler

spud schrieb:
> Hi guys,
> 
> I've downloaded PDFBox and am very impressed with its functionality
> given that it's not even at version 1 yet!
> 
> I wondered if someone could help me extract text from a PDF with
> formatting however. I've looked at the TextStripper class, and it does
> seem great for ripping plain text from the PDF, however I would like
> to have some basic formatting retained. I don't care about images or
> text placement, just relative text sizes and fonts. For example, using
> a commercial tool I managed to get:
> 
> <TEXT>
> <p><font face="Swiss721SWA"
> style="font-size:10pt;font-style:Bold">BODY DIMENSIONS <SpaceCount
> space="65" />1I-1</font></p>
> <p><font face="Swiss721SWA" style="font-size:9pt">Three-dimensional
> <SpaceCount space="12" />Center-1o-center <SpaceCount space="14"
> /></font><font face="Swiss721SWA"
> style="font-size:16pt;font-style:Bold">GENERAL INFORMATION</font></p>
> </TEXT>
> 
> This is good enough for me as I do not care about layout, and I can
> reason from the font size that General Information is probably a
> title, and Body Dimensions is a sub-title.
> 
> My question to you guys is, how can I get this with PDFBox? I can see
> that you have some classes that seem to support fonts and size of
> text, but I can't see any examples that use them and i see any route
> to getting to them from the base PDDocument.
> 
> Any help is greatly appreciated!

Reply via email to