Generous Pohshna wrote:
> Hi Craig,
> 
> Thanks a lot for replying. That was an informative mail. At least illl 
> know where to get started.

No problem. Wish I could be more helpful.

By the way, please use "reply to all" so that the list gets a copy of 
the mail. It might help someone else with a similar question who's 
searching for information later.

> Well regarding ur questions.
> like the program
> used to create the PDF. Right now i dont have information on the program 
> that generated the PDF.

You must have one of the PDFs you need to work with, though, right?

You should be able to see what program created it if you have one of the 
PDFs. Just open it in Adobe Reader or Adobe Acrobat and view the 
document properties in the file menu. The Creator and the Producer 
fields tell you which software was used to make the PDF.

> Also getting the position of the table data too would be difficult.

It's not that you need the co-ordinates of the data. Rather, the trouble 
is that the data in a PDF content stream just isn't structured in a way 
that makes it easy to extract particular pieces of information. It's 
more like PostScript or really badly written old style HTML in that the 
formatting is completely mixed up with the data being formatted. For the 
uses PDF is designed for that's just fine, but it does make it hard to 
get data out if you do need to.

If you want to see what I mean, use podofobrowser to examine a PDF 
content stream, or use podofouncompress to make a human-readable version 
of the PDF and view that in a text editor. The PDF is structured as a 
bunch of objects, each of which contains various data structures - 
usually dictionaries - and possibly a data stream.

PDF content streams appear as data streams. Here's an informative quote 
from the pdf reference:

----
Example 5.1 illustrates the most straightforward use of a font. The text 
ABC is placed 10 inches from the bottom of the page and 4 inches from 
the left edge, using 12-point Helvetica.

BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET

The five lines of this example perform the following steps:
1. Begin a text object.
2. Set the font and font size to use, installing them as parameters in 
the text state.
(The font resource identified by the name F13 specifies the font 
externally known as Helvetica.)
3. Specify a starting position on the page, setting parameters in the 
text object.
4. Paint the glyphs for a string of characters at that position.
5. End the text object.
----

(That's an example from the PDF reference, section 5.1.1, from 
http://www.adobe.com/devnet/pdf/pdf_reference.html, which you REALLY 
need to download and use as a reference).

As you can see, the string (ABC) is surrounded by a bunch of formatting 
operators. To extract it, you need to process the content stream. 
There's no guarantee that strings will appear in reading order, or as 
whole words/phrases. For example, instead of (PoDoFo) a PDF could 
contain (Po)...blah...(DoF)....blah...(o)  ... say, if there was some 
per-character layout control being applied. In fact, especially in the 
presence of columns or other complex layout there's sometimes little 
resemblance between the order of the content stream data and how it 
renders. (I don't know much about PDF content streams, but anyone who's 
had to suffer though trying to get text out of a PDF knows that much 
pretty quickly).

In your case, even if your table looks like:

----------------
   KEY  |  VALUE
-----------------
   k1   |   v1
   k2   |   v2

... the actual text elements could appear in the PDF in all sorts of 
orders, surrounded by various positioning and formatting operators. 
There isn't even a guarantee that they're part of the same content 
stream - the app could use XObjects or just multiple content streams per 
page.

They might not even be text. Sometimes software will convert text to 
outlines - essentially to a mathematical description of the shape of the 
character. In that case there's no longer any text string in the PDF at all.

If you only have to handle data from one particular application you can 
probably figure out how it arranges it and extract it by processing the 
content stream. It'll take some work, though, and there's no guarantee 
it'll be reliable. If you have another way of obtaining the same data, 
consider looking into it.

--
Craig Ringer

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services
for just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to