Re: [iText-questions] How to extract title / heading from document contents

Mark Storer Fri, 24 Jun 2011 10:37:44 -0700

iText can give you the font name, size, and location of all the text on
the page.  Without PDF Structure, it is up to you to interpret that
information.
 
For a title, you might consider all the text in the largest font on the
first page to be the title.  Such heuristics will be brittle.  You could
refine or relax it in various ways to work better with your particular
PDFs, but at the end of the day, it will always be possible to find (or
create) a PDF that will break your heuristic.
 
A biography of Theodore Roosevelt might be entitled:
 
Speak softly and 
Carry a Big Stick.
 
A reasonable heuristic could determine this title to be "Carry a Big
Stick".  And it would be wrong.
 
--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;



________________________________

        From: Balder [mailto:[email protected]] 
        Sent: Friday, June 24, 2011 8:31 AM
        To: [email protected]
        Subject: Re: [iText-questions] How to extract title / heading
from document contents
        
        
        This depends on the PDF,
        
         is the PDF Tagged? Then you might be able to find out what's
the title and heading. If it's not tagged good luck with guessing the
title and heading from the text found in the document. 
        
        On 24/06/2011 14:10, modie wrote: 

                Hi,
                
                Sorry, I am new to iTextSharp and cannot find
documentation for it anyway,
                other than this forum. I am looking to extract content
from a PDF document,
                but I need to be able to understand the structure /
markup in the document. 
                
                I want to extract the heading / title for the document
which would generally
                found on the first page. Any ideas how I would do this?
In html I would look
                for the h1 or h2 tag?
                
                PS - no, I dont want the title property of the document
                 
                
                --
                View this message in context:
http://itext-general.2136553.n4.nabble.com/How-to-extract-title-heading-
from-document-contents-tp3622357p3622357.html
                Sent from the iText - General mailing list archive at
Nabble.com.
                
        
------------------------------------------------------------------------
------
                All the data continuously generated in your IT
infrastructure contains a 
                definitive record of customers, application performance,
security 
                threats, fraudulent activity and more. Splunk takes this
data and makes 
                sense of it. Business sense. IT sense. Common sense.. 
                http://p.sf.net/sfu/splunk-d2d-c1
                _______________________________________________
                iText-questions mailing list
                [email protected]
        
https://lists.sourceforge.net/lists/listinfo/itext-questions
                
                iText(R) is a registered trademark of 1T3XT BVBA.
                Many questions posted to this list can (and will) be
answered with a reference to the iText book:
http://www.itextpdf.com/book/
                Please check the keywords list before you ask for
examples: http://itextpdf.com/themes/keywords.php


        -- 
        
        @redlabbe <http://twitter.com/redlabbe> 
        redlab-log <http://www.redlab.be/blog>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] How to extract title / heading from document contents

Reply via email to