Looks exciting! Having something such as this as part of the base API seems to make a lot of sense.
I would think that it would be a significant component of the first stage of a two-stage PDF rendering process. The first stage would use podofo to parse a pages graphical description into memory (perhaps into a tree organization as described by Craig). The second stage would descend through the tree created by podofo and invoke a separate renderer to paint the graphical objects (path, text, image) found within the tree. I've downloaded and am now reading (hopefully end-to-end) the 1000+ page 1.7 PDF specification. That's a beefy spec! -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dominik Seichter Sent: Wednesday, November 07, 2007 4:06 PM To: [email protected] Cc: Sargrad, Dave Subject: Re: [Podofo-users] trying to build podofo using visual studio Hi, This discussion made me curious. So I tried to implement a simple content stream parser on top of PdfTokenizer. I commited it with some modifications to PdfTokenizer to SVN under podofo/test/ContentParser/. The example takes a PDF file as argument and parses the content stream of the first page. Example output of the application looks like this: <snip> Keyword: BT Variant: /Ft11 Variant: 16.000000 Keyword: Tf Variant: 100.000000 Keyword: Tz Variant: 0.000000 Keyword: Tc Variant: 340.157000 Variant: 795.779000 Keyword: Td Variant: <477261797363616C65202D20436F6C6F727370616365> Keyword: Tj Keyword: ET Variant: 340.157000 Variant: 795.779000 Variant: 167.172000 Variant: 17.875000 Keyword: re Keyword: S Variant: 0.000000 Keyword: w Variant: 0.000000 Keyword: G </snap> So as you see, it can decide if it has read a keyword or a PdfVariant. Every read PdfVariant is pushed on a stack. If a token like "m" for moveto is read, the two top values are taken from the stack and used as parameters for the "moveto" function. So a complete parser might look like this: void parse_contents( PdfContentsTokenizer* pTokenizer ) { const char* pszToken = NULL; PdfVariant var; EPdfContentsType eType; std::string str; std::stack<PdfVariant> stack; std::cout << std::endl << "Parsing a page:" << std::endl; try { while( true ) { pTokenizer->ReadNext( &eType, &pszToken, var ); if( eType == ePdfContentsType_Keyword ) { std::cout << "Keyword: " << pszToken << std::endl; // support 'l' and 'm' tokens if( strcmp( pszToken, "l" ) == 0 ) { double dPosY = stack.top().GetReal(); stack.pop(); double dPosX = stack.top().GetReal(); stack.pop(); std::cout << "LineTo: " << dPosX << " " << dPosY << std::endl; } else if( strcmp( pszToken, "m" ) == 0 ) { double dPosY = stack.top().GetReal(); stack.pop(); double dPosX = stack.top().GetReal(); stack.pop(); std::cout << "MoveTo: " << dPosX << " " << dPosY << std::endl; } } else { var.ToString( str ); std::cout << "Variant: " << str << std::endl; stack.push( var ); } } } catch( const PdfError & e ) { if( e.GetError() == ePdfError_UnexpectedEOF ) return; // done with the stream else throw e; } } Ok, you have to support more functions than "m" and "l" in the case of a keyword. But I think it could work like this. Did I miss something or is it really so easy? Apart from error checking and a good algorithm to find a function and the number of parameters of course .... If we do not use a simple stack, but a tree we should have the tree structure craig was talking about! So, Craig - when will you add this to PoDoFoBrowser? :) To be honest: if the PdfContentsTokenizer class seems useful, I would like to add it to PoDoFo's main API. best regards, Dom Am Wednesday 07 November 2007 schrieb Craig Ringer: > Sargrad, Dave wrote: > > Now that you've seen the pdf files that im currently interested and > > understand that I'm willing to put in the effort to "do this right", > > and to contribute something back to the community, please help me to > > understand the appropriate initial characteristics (API) of the > > "content stream parser". > > OOh, one quick idea that wouldn't be needed for rendering data from a > content stream but would be really handy for editing and analysing it > would be to have the option of using the low level content stream > parser's output to populate a tree representation. > > Content streams are stack based with postfix operators, right - RPN > style, in essence, ie > 4 3 + 2 - > . > > It's a fairly simple problem to read a RPN-style sequence into a tree > with interior nodes as operators and leaves as operands. Such a tree > could be easily created from the output of a content stream parser > that worked like the one I was muttering about. Once you have the > tree, you can do all sorts of interesting things with it. It'd make > editing content streams much easier, and when you're done and want to > write it out you just traverse the tree depth-first . > > Hmm. Fun. Not much good for rendering, probably, even if it was > possible to read the stream one top-level child node at a time (which > it should be). For editing, though... I'm going to have to add this to > PoDoFoBrowser. > > Anyway, definitely sleep time. > > -- > Craig Ringer > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Podofo-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/podofo-users -- ********************************************************************** Dominik Seichter - [EMAIL PROTECTED] KRename - http://www.krename.net - Powerful batch renamer for KDE KBarcode - http://www.kbarcode.net - Barcode and label printing PoDoFo - http://podofo.sf.net - PDF generation and parsing library SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game, for KDE Alan - http://alan.sf.net - A Turing Machine in Java ********************************************************************** ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
