Hi,

This discussion made me curious. So I tried to implement a simple content 
stream parser on top of PdfTokenizer. I commited it with some modifications 
to PdfTokenizer to SVN under podofo/test/ContentParser/.

The example takes a PDF file as argument and parses the content stream of the 
first page.

Example output of the application looks like this:
<snip>
Keyword: BT
Variant: /Ft11
Variant: 16.000000
Keyword: Tf
Variant: 100.000000
Keyword: Tz
Variant: 0.000000
Keyword: Tc
Variant: 340.157000
Variant: 795.779000
Keyword: Td
Variant: <477261797363616C65202D20436F6C6F727370616365>
Keyword: Tj
Keyword: ET
Variant: 340.157000
Variant: 795.779000
Variant: 167.172000
Variant: 17.875000
Keyword: re
Keyword: S
Variant: 0.000000
Keyword: w
Variant: 0.000000
Keyword: G
</snap>

So as you see, it can decide if it has read a keyword or a PdfVariant. Every 
read PdfVariant is pushed on a stack. If a token like "m" for moveto is read, 
the two top values are taken from the stack and used as parameters for 
the "moveto" function.

So a complete parser might look like this:

void parse_contents( PdfContentsTokenizer* pTokenizer ) 
{
    const char*      pszToken = NULL;
    PdfVariant       var;
    EPdfContentsType eType;
    std::string      str;

    std::stack<PdfVariant> stack;
    std::cout << std::endl << "Parsing a page:" << std::endl;

    try 
    {
        while( true )
        {
            pTokenizer->ReadNext( &eType, &pszToken, var );
            if( eType == ePdfContentsType_Keyword )
            {
                std::cout << "Keyword: " << pszToken << std::endl;

                // support 'l' and 'm' tokens
                if( strcmp( pszToken, "l" ) == 0 ) 
                {
                    double dPosY = stack.top().GetReal();
                    stack.pop();
                    double dPosX = stack.top().GetReal();
                    stack.pop();

                    std::cout << "LineTo: " << dPosX << " " << dPosY << 
std::endl;
                }
                else if( strcmp( pszToken, "m" ) == 0 ) 
                {
                    double dPosY = stack.top().GetReal();
                    stack.pop();
                    double dPosX = stack.top().GetReal();
                    stack.pop();

                    std::cout << "MoveTo: " << dPosX << " " << dPosY << 
std::endl;
                }

            }
            else
            {
                var.ToString( str );
                std::cout << "Variant: " << str << std::endl;
                stack.push( var );
            }
        }
    }
    catch( const PdfError & e )
    {
        if( e.GetError() == ePdfError_UnexpectedEOF )
            return; // done with the stream
        else 
            throw e;
    }
}

Ok, you have to support more functions than "m" and "l" in the case of a 
keyword. But I think it could work like this.

Did I miss something or is it really so easy? Apart from error checking and a 
good algorithm to find a function and the number of parameters of course ....

If we do not use a simple stack, but a tree we should have the tree structure 
craig was talking about! So, Craig - when will you add this to 
PoDoFoBrowser? :) To be honest: if the PdfContentsTokenizer class seems 
useful, I would like to add it to PoDoFo's main API.

best regards,
        Dom

Am Wednesday 07 November 2007 schrieb Craig Ringer:
> Sargrad, Dave wrote:
> > Now that you've seen the pdf files that im currently interested and
> > understand that I'm willing to put in the effort to "do this right", and
> > to contribute something back to the community, please help me to
> > understand the appropriate initial characteristics (API) of the "content
> > stream parser".
>
> OOh, one quick idea that wouldn't be needed for rendering data from a
> content stream but would be really handy for editing and analysing it
> would be to have the option of using the low level content stream
> parser's output to populate a tree representation.
>
> Content streams are stack based with postfix operators, right - RPN
> style, in essence, ie
>       4 3 + 2 -
> .
>
> It's a fairly simple problem to read a RPN-style sequence into a tree
> with interior nodes as operators and leaves as operands. Such a tree
> could be easily created from the output of a content stream parser that
> worked like the one I was muttering about. Once you have the tree, you
> can do all sorts of interesting things with it. It'd make editing
> content streams much easier, and when you're done and want to write it
> out you just traverse the tree depth-first .
>
> Hmm. Fun. Not much good for rendering, probably, even if it was possible
> to read the stream one top-level child node at a time (which it should
> be). For editing, though... I'm going to have to add this to PoDoFoBrowser.
>
> Anyway, definitely sleep time.
>
> --
> Craig Ringer
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Podofo-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/podofo-users



-- 
**********************************************************************
Dominik Seichter - [EMAIL PROTECTED]
KRename  - http://www.krename.net  - Powerful batch renamer for KDE
KBarcode - http://www.kbarcode.net - Barcode and label printing
PoDoFo - http://podofo.sf.net - PDF generation and parsing library
SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game,  for KDE
Alan - http://alan.sf.net - A Turing Machine in Java
**********************************************************************

Attachment: pgpK6NACT5KAm.pgp
Description: PGP signature

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to