Hi guys,

I've downloaded PDFBox and am very impressed with its functionality
given that it's not even at version 1 yet!

I wondered if someone could help me extract text from a PDF with
formatting however. I've looked at the TextStripper class, and it does
seem great for ripping plain text from the PDF, however I would like
to have some basic formatting retained. I don't care about images or
text placement, just relative text sizes and fonts. For example, using
a commercial tool I managed to get:

<TEXT>
<p><font face="Swiss721SWA"
style="font-size:10pt;font-style:Bold">BODY DIMENSIONS <SpaceCount
space="65" />1I-1</font></p>
<p><font face="Swiss721SWA" style="font-size:9pt">Three-dimensional
<SpaceCount space="12" />Center-1o-center <SpaceCount space="14"
/></font><font face="Swiss721SWA"
style="font-size:16pt;font-style:Bold">GENERAL INFORMATION</font></p>
</TEXT>

This is good enough for me as I do not care about layout, and I can
reason from the font size that General Information is probably a
title, and Body Dimensions is a sub-title.

My question to you guys is, how can I get this with PDFBox? I can see
that you have some classes that seem to support fonts and size of
text, but I can't see any examples that use them and i see any route
to getting to them from the base PDDocument.

Any help is greatly appreciated!

Reply via email to