I was going to suggest exactly the same thing - using PdfTokenizer as
the basis for the content stream parser - since, yes, it's just that
simple. It's a slightly extended tokenizer that knows about commands/
operators (which you call keywords) and their operands
(parameters). That's the easy part ;).
Now you have to decide if you are going to do a "display list"
architecture or a "stream" architecture.
Poppler/Xpdf uses the latter - which has the advantage of performance
for single renders and low memory usage. However, a display list
(DL) solution (ala Acrobat) offers better performance for multiple
renders (say at various zoom levels) AND is necessary for doing
editing/modifications on the content stream. It's also better for
doing multiple operations on a single stream (eg. render to screen
AND extract text) BUT at the expense of memory and initial render
speed (though the latter can be solved by doing initial rendering as
you build the DL).
You can either do your DL as a tree (as Craig suggested) or a simple
list/vector - I've done it both ways and each (surprise) have pros
and cons. list/vector is easier to traverse and can use "out of the
box" STL data types + iterators, while a tree means either building a
new one (hopefully with iterator support!) or using something like
boost's graph library or the ASL's tree classes which do this
already. One advantage to the tree is that automatically handles
"grouping" of container operators (BT/ET, q/Q, etc.) - while you'd
need to manage that "the hard way" using a list/vector.
Just some thoughts to give you more to contemplate...
Leonard
On Nov 7, 2007, at 10:05 PM, Dominik Seichter wrote:
> Hi,
>
> This discussion made me curious. So I tried to implement a simple
> content
> stream parser on top of PdfTokenizer. I commited it with some
> modifications
> to PdfTokenizer to SVN under podofo/test/ContentParser/.
>
> The example takes a PDF file as argument and parses the content
> stream of the
> first page.
>
> Example output of the application looks like this:
> <snip>
> Keyword: BT
> Variant: /Ft11
> Variant: 16.000000
> Keyword: Tf
> Variant: 100.000000
> Keyword: Tz
> Variant: 0.000000
> Keyword: Tc
> Variant: 340.157000
> Variant: 795.779000
> Keyword: Td
> Variant: <477261797363616C65202D20436F6C6F727370616365>
> Keyword: Tj
> Keyword: ET
> Variant: 340.157000
> Variant: 795.779000
> Variant: 167.172000
> Variant: 17.875000
> Keyword: re
> Keyword: S
> Variant: 0.000000
> Keyword: w
> Variant: 0.000000
> Keyword: G
> </snap>
>
> So as you see, it can decide if it has read a keyword or a
> PdfVariant. Every
> read PdfVariant is pushed on a stack. If a token like "m" for
> moveto is read,
> the two top values are taken from the stack and used as parameters for
> the "moveto" function.
>
> So a complete parser might look like this:
>
> void parse_contents( PdfContentsTokenizer* pTokenizer )
> {
> const char* pszToken = NULL;
> PdfVariant var;
> EPdfContentsType eType;
> std::string str;
>
> std::stack<PdfVariant> stack;
> std::cout << std::endl << "Parsing a page:" << std::endl;
>
> try
> {
> while( true )
> {
> pTokenizer->ReadNext( &eType, &pszToken, var );
> if( eType == ePdfContentsType_Keyword )
> {
> std::cout << "Keyword: " << pszToken << std::endl;
>
> // support 'l' and 'm' tokens
> if( strcmp( pszToken, "l" ) == 0 )
> {
> double dPosY = stack.top().GetReal();
> stack.pop();
> double dPosX = stack.top().GetReal();
> stack.pop();
>
> std::cout << "LineTo: " << dPosX << " " <<
> dPosY <<
> std::endl;
> }
> else if( strcmp( pszToken, "m" ) == 0 )
> {
> double dPosY = stack.top().GetReal();
> stack.pop();
> double dPosX = stack.top().GetReal();
> stack.pop();
>
> std::cout << "MoveTo: " << dPosX << " " <<
> dPosY <<
> std::endl;
> }
>
> }
> else
> {
> var.ToString( str );
> std::cout << "Variant: " << str << std::endl;
> stack.push( var );
> }
> }
> }
> catch( const PdfError & e )
> {
> if( e.GetError() == ePdfError_UnexpectedEOF )
> return; // done with the stream
> else
> throw e;
> }
> }
>
> Ok, you have to support more functions than "m" and "l" in the case
> of a
> keyword. But I think it could work like this.
>
> Did I miss something or is it really so easy? Apart from error
> checking and a
> good algorithm to find a function and the number of parameters of
> course ....
>
> If we do not use a simple stack, but a tree we should have the tree
> structure
> craig was talking about! So, Craig - when will you add this to
> PoDoFoBrowser? :) To be honest: if the PdfContentsTokenizer class
> seems
> useful, I would like to add it to PoDoFo's main API.
>
> best regards,
> Dom
>
> Am Wednesday 07 November 2007 schrieb Craig Ringer:
>> Sargrad, Dave wrote:
>>> Now that you've seen the pdf files that im currently interested and
>>> understand that I'm willing to put in the effort to "do this
>>> right", and
>>> to contribute something back to the community, please help me to
>>> understand the appropriate initial characteristics (API) of the
>>> "content
>>> stream parser".
>>
>> OOh, one quick idea that wouldn't be needed for rendering data from a
>> content stream but would be really handy for editing and analysing it
>> would be to have the option of using the low level content stream
>> parser's output to populate a tree representation.
>>
>> Content streams are stack based with postfix operators, right - RPN
>> style, in essence, ie
>> 4 3 + 2 -
>> .
>>
>> It's a fairly simple problem to read a RPN-style sequence into a tree
>> with interior nodes as operators and leaves as operands. Such a tree
>> could be easily created from the output of a content stream parser
>> that
>> worked like the one I was muttering about. Once you have the tree,
>> you
>> can do all sorts of interesting things with it. It'd make editing
>> content streams much easier, and when you're done and want to
>> write it
>> out you just traverse the tree depth-first .
>>
>> Hmm. Fun. Not much good for rendering, probably, even if it was
>> possible
>> to read the stream one top-level child node at a time (which it
>> should
>> be). For editing, though... I'm going to have to add this to
>> PoDoFoBrowser.
>>
>> Anyway, definitely sleep time.
>>
>> --
>> Craig Ringer
>>
>> ---------------------------------------------------------------------
>> ----
>> This SF.net email is sponsored by: Splunk Inc.
>> Still grepping through log files to find problems? Stop.
>> Now Search log events and configuration files using AJAX and a
>> browser.
>> Download your FREE copy of Splunk now >> http://get.splunk.com/
>> _______________________________________________
>> Podofo-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/podofo-users
>
>
>
> --
> **********************************************************************
> Dominik Seichter - [EMAIL PROTECTED]
> KRename - http://www.krename.net - Powerful batch renamer for KDE
> KBarcode - http://www.kbarcode.net - Barcode and label printing
> PoDoFo - http://podofo.sf.net - PDF generation and parsing library
> SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game,
> for KDE
> Alan - http://alan.sf.net - A Turing Machine in Java
> **********************************************************************
> ----------------------------------------------------------------------
> ---
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a
> browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Podofo-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/podofo-users
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users