Looks exciting! Having something such as this as part of the base API
seems to make a lot of sense. 

I would think that it would be a significant component of the first
stage of a two-stage PDF rendering process. The first stage would use
podofo to parse a pages graphical description into memory (perhaps into
a tree organization as described by Craig). The second stage would
descend through the tree created by podofo and invoke a separate
renderer to paint the graphical objects (path, text, image) found within
the tree.

I've downloaded and am now reading (hopefully end-to-end) the 1000+ page
1.7 PDF specification. That's a beefy spec!





-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Dominik
Seichter
Sent: Wednesday, November 07, 2007 4:06 PM
To: [email protected]
Cc: Sargrad, Dave
Subject: Re: [Podofo-users] trying to build podofo using visual studio


Hi,

This discussion made me curious. So I tried to implement a simple
content 
stream parser on top of PdfTokenizer. I commited it with some
modifications 
to PdfTokenizer to SVN under podofo/test/ContentParser/.

The example takes a PDF file as argument and parses the content stream
of the 
first page.

Example output of the application looks like this:
<snip>
Keyword: BT
Variant: /Ft11
Variant: 16.000000
Keyword: Tf
Variant: 100.000000
Keyword: Tz
Variant: 0.000000
Keyword: Tc
Variant: 340.157000
Variant: 795.779000
Keyword: Td
Variant: <477261797363616C65202D20436F6C6F727370616365>
Keyword: Tj
Keyword: ET
Variant: 340.157000
Variant: 795.779000
Variant: 167.172000
Variant: 17.875000
Keyword: re
Keyword: S
Variant: 0.000000
Keyword: w
Variant: 0.000000
Keyword: G
</snap>

So as you see, it can decide if it has read a keyword or a PdfVariant.
Every 
read PdfVariant is pushed on a stack. If a token like "m" for moveto is
read, 
the two top values are taken from the stack and used as parameters for 
the "moveto" function.

So a complete parser might look like this:

void parse_contents( PdfContentsTokenizer* pTokenizer ) 
{
    const char*      pszToken = NULL;
    PdfVariant       var;
    EPdfContentsType eType;
    std::string      str;

    std::stack<PdfVariant> stack;
    std::cout << std::endl << "Parsing a page:" << std::endl;

    try 
    {
        while( true )
        {
            pTokenizer->ReadNext( &eType, &pszToken, var );
            if( eType == ePdfContentsType_Keyword )
            {
                std::cout << "Keyword: " << pszToken << std::endl;

                // support 'l' and 'm' tokens
                if( strcmp( pszToken, "l" ) == 0 ) 
                {
                    double dPosY = stack.top().GetReal();
                    stack.pop();
                    double dPosX = stack.top().GetReal();
                    stack.pop();

                    std::cout << "LineTo: " << dPosX << " " << dPosY << 
std::endl;
                }
                else if( strcmp( pszToken, "m" ) == 0 ) 
                {
                    double dPosY = stack.top().GetReal();
                    stack.pop();
                    double dPosX = stack.top().GetReal();
                    stack.pop();

                    std::cout << "MoveTo: " << dPosX << " " << dPosY << 
std::endl;
                }

            }
            else
            {
                var.ToString( str );
                std::cout << "Variant: " << str << std::endl;
                stack.push( var );
            }
        }
    }
    catch( const PdfError & e )
    {
        if( e.GetError() == ePdfError_UnexpectedEOF )
            return; // done with the stream
        else 
            throw e;
    }
}

Ok, you have to support more functions than "m" and "l" in the case of a

keyword. But I think it could work like this.

Did I miss something or is it really so easy? Apart from error checking
and a 
good algorithm to find a function and the number of parameters of course
....

If we do not use a simple stack, but a tree we should have the tree
structure 
craig was talking about! So, Craig - when will you add this to 
PoDoFoBrowser? :) To be honest: if the PdfContentsTokenizer class seems 
useful, I would like to add it to PoDoFo's main API.

best regards,
        Dom

Am Wednesday 07 November 2007 schrieb Craig Ringer:
> Sargrad, Dave wrote:
> > Now that you've seen the pdf files that im currently interested and 
> > understand that I'm willing to put in the effort to "do this right",

> > and to contribute something back to the community, please help me to

> > understand the appropriate initial characteristics (API) of the 
> > "content stream parser".
>
> OOh, one quick idea that wouldn't be needed for rendering data from a 
> content stream but would be really handy for editing and analysing it 
> would be to have the option of using the low level content stream 
> parser's output to populate a tree representation.
>
> Content streams are stack based with postfix operators, right - RPN 
> style, in essence, ie
>       4 3 + 2 -
> .
>
> It's a fairly simple problem to read a RPN-style sequence into a tree 
> with interior nodes as operators and leaves as operands. Such a tree 
> could be easily created from the output of a content stream parser 
> that worked like the one I was muttering about. Once you have the 
> tree, you can do all sorts of interesting things with it. It'd make 
> editing content streams much easier, and when you're done and want to 
> write it out you just traverse the tree depth-first .
>
> Hmm. Fun. Not much good for rendering, probably, even if it was 
> possible to read the stream one top-level child node at a time (which 
> it should be). For editing, though... I'm going to have to add this to

> PoDoFoBrowser.
>
> Anyway, definitely sleep time.
>
> --
> Craig Ringer
>
> ----------------------------------------------------------------------
> ---
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a
browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Podofo-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/podofo-users



-- 
**********************************************************************
Dominik Seichter - [EMAIL PROTECTED]
KRename  - http://www.krename.net  - Powerful batch renamer for KDE
KBarcode - http://www.kbarcode.net - Barcode and label printing PoDoFo -
http://podofo.sf.net - PDF generation and parsing library SchafKopf -
http://schafkopf.berlios.de - Schafkopf, a card game,  for KDE Alan -
http://alan.sf.net - A Turing Machine in Java
**********************************************************************

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to