iText can give you the font name, size, and location of all the text on
the page. Without PDF Structure, it is up to you to interpret that
information.
For a title, you might consider all the text in the largest font on the
first page to be the title. Such heuristics will be brittle. You could
refine or relax it in various ways to work better with your particular
PDFs, but at the end of the day, it will always be possible to find (or
create) a PDF that will break your heuristic.
A biography of Theodore Roosevelt might be entitled:
Speak softly and
Carry a Big Stick.
A reasonable heuristic could determine this title to be "Carry a Big
Stick". And it would be wrong.
--Mark Storer
Senior Software Engineer
Cardiff.com
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
________________________________
From: Balder [mailto:[email protected]]
Sent: Friday, June 24, 2011 8:31 AM
To: [email protected]
Subject: Re: [iText-questions] How to extract title / heading
from document contents
This depends on the PDF,
is the PDF Tagged? Then you might be able to find out what's
the title and heading. If it's not tagged good luck with guessing the
title and heading from the text found in the document.
On 24/06/2011 14:10, modie wrote:
Hi,
Sorry, I am new to iTextSharp and cannot find
documentation for it anyway,
other than this forum. I am looking to extract content
from a PDF document,
but I need to be able to understand the structure /
markup in the document.
I want to extract the heading / title for the document
which would generally
found on the first page. Any ideas how I would do this?
In html I would look
for the h1 or h2 tag?
PS - no, I dont want the title property of the document
--
View this message in context:
http://itext-general.2136553.n4.nabble.com/How-to-extract-title-heading-
from-document-contents-tp3622357p3622357.html
Sent from the iText - General mailing list archive at
Nabble.com.
------------------------------------------------------------------------
------
All the data continuously generated in your IT
infrastructure contains a
definitive record of customers, application performance,
security
threats, fraudulent activity and more. Splunk takes this
data and makes
sense of it. Business sense. IT sense. Common sense..
http://p.sf.net/sfu/splunk-d2d-c1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be
answered with a reference to the iText book:
http://www.itextpdf.com/book/
Please check the keywords list before you ask for
examples: http://itextpdf.com/themes/keywords.php
--
@redlabbe <http://twitter.com/redlabbe>
redlab-log <http://www.redlab.be/blog>
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense..
http://p.sf.net/sfu/splunk-d2d-c1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php