Hello André,

have a look on the PDFTextStripper. It collects tokens from a given document (so called TextPositions). A TextPosition object has as a method called getFont which returns you the font object encapsulating font information for the current token. What you can do, is to retrieve the base font name from the font object (the postscript name of the font) and check, if its end with the postfix -bold or whatever (this is at least what i did to detect bold text blocks). Further a TextPosition object contains the attribute fontSize. With this attribute you should be able to detect larger text tokens by (just a suggestion) parsing an entire page, computing the median font size, parsing the page again and checking it the fontSize of a token is above the median.

I hope i could help you.

With kind regards,
Robert



André Ramos schrieb:
Hello,

I'd like to use PDFBox to extract text with special features like: bold
text, italicized text, text whose font size is above average and so on. The
idea is that any kind of highlighted text or any text formatted out of the
ordinary within a document must contain relevant terms to describe the
document.

How can I do it?

Thank you.


Reply via email to