Hi,

(I am answering to the list, as this might be interesting for more people)

Am Dienstag, 1. September 2009 schrieb Jose:
> Hi, thank you for answering !
>
> On Mon, Aug 31, 2009 at 10:35 PM, Dominik Seichter<[email protected]> 
wrote:
> > Hi,
> >
> > Please take a look at tools/podofotxtextract.
> >
> > It shows how text extraction using PoDoFo could look. Please note that
> > this code is only a small demo and does not care much about most PDF
> > position commands and will support only the most basic encodings.
> >
> > I would love to extend this tool, though. So patches welcome!
>
> The example is already very useful. I am a newbie to the pdf spec, so
> I tried the tool with one pdf file that follows the normal text flow,
> e.g.
>
> - short line number 1
> - this is a very long line. Very
>   long line
>
> I get five elements all with coords (752,565), which are
>
> 1. First line hyphen
> 2. Text line
> 3. Second line hyphen
> 4. Second line
> 5. Line 3
>
> The flow makes sense. My questions are:
>
> 1) How do I find newlines if I don't know the format ? (it is easy to
> find the newlines when knowing the format but I want a general method)
There are normally no new lines in PDF text. Who have to "guess" them, by 
parsing all the positioning commands in the PDF. If you know where each text 
elements appears in the PDF you can find heuristics to find out where a 
newline should be.

>
> 2) When all elements have the same coords, does it mean they are part
> of the page body and follow a normal flow?
The podofotxtextract example understands only the two pdf position 
commands "m" and "l" (moveto and lineto). PDF has lot's of positioning 
commands and matrix operations which can affect the position. You have to 
parse all of these to get the correct positions.

best regards,
        Dom


>
> regards
>
>
> PS: I hope to contribute once I know a bit more!



-- 
**********************************************************************
Dominik Seichter - [email protected]
KRename  - http://www.krename.net  - Powerful batch renamer for KDE
KBarcode - http://www.kbarcode.net - Barcode and label printing
PoDoFo - http://podofo.sf.net - PDF generation and parsing library
SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game,  for KDE
Alan - http://alan.sf.net - A Turing Machine in Java
**********************************************************************

Attachment: signature.asc
Description: This is a digitally signed message part.

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to