Generous Pohshna wrote: > Hi everyone, > > I am newbie to PDF files and their structure. > > However i need to parse a pdf file and read all the contents of the file. > Infact i need some text from the pdf and use which can be inside tables.
It depends on how the PDF was produced, but in general this is not easy. The visible contents of a PDF - text, lines, etc - are mostly contained within content streams. These are sequences of graphics operations that describe the appearance of the page or other object. It's not like (say) HTML where you have marked up structure like: <table><tr><td>item</td><td>value</td></tr></table> Rather, a table in PDF would usually say something along the lines of Draw the text `item' at (500,500) Draw the text `value' at (600,600) Draw a line from (490,480) to (490,520) ... etc. To obtain the table data you might have to process the content stream and extract what you want based on location in the stream, on-page position, or other factors. The PdfContentsParser class in PoDoFo will help with this; have a look at test/ContentsParser/ for an example/test program. Alternately, you could use an existing package for extracting text from PDF and process the resulting text. > I found this library PoDoFo. > > So how should i use this library to achive what i intended. As noted above, the nature of the PDF format makes what you want potentially rather tricky. Occasionally a PDF will contain data that's meant for extraction and processing by other software too. Generally the data will be something like an embedded copy of the document the PDF was made from, some application specific data (Illustrator info; XML snippets from various apps; etc) or additional metadata like JDF information. IF your PDF was intended for machine processing it could potentially contain the data you need outside a content stream. I'd know if I was leading you down the wrong path if you provided some more information, like the program used to create the PDF. -- Craig Ringer ------------------------------------------------------------------------- SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
