Re: Extracting Features From Text

Robert Pesch Wed, 08 Jul 2009 01:55:40 -0700

Hello André,

have a look on the PDFTextStripper. It collects tokens from a givendocument (so called TextPositions). A TextPosition object has as amethod called getFont which returns you the font object encapsulatingfont information for the current token. What you can do, is to retrievethe base font name from the font object (the postscript name of thefont) and check, if its end with the postfix -bold or whatever (this isat least what i did to detect bold text blocks). Further a TextPositionobject contains the attribute fontSize. With this attribute you shouldbe able to detect larger text tokens by (just a suggestion) parsing anentire page, computing the median font size, parsing the page again andchecking it the fontSize of a token is above the median.


I hope i could help you.

With kind regards,
Robert



André Ramos schrieb:

Hello,

I'd like to use PDFBox to extract text with special features like: bold
text, italicized text, text whose font size is above average and so on. The
idea is that any kind of highlighted text or any text formatted out of the
ordinary within a document must contain relevant terms to describe the
document.

How can I do it?

Thank you.

Re: Extracting Features From Text

Reply via email to