Hi, This discussion made me curious. So I tried to implement a simple content stream parser on top of PdfTokenizer. I commited it with some modifications to PdfTokenizer to SVN under podofo/test/ContentParser/.
The example takes a PDF file as argument and parses the content stream of the
first page.
Example output of the application looks like this:
<snip>
Keyword: BT
Variant: /Ft11
Variant: 16.000000
Keyword: Tf
Variant: 100.000000
Keyword: Tz
Variant: 0.000000
Keyword: Tc
Variant: 340.157000
Variant: 795.779000
Keyword: Td
Variant: <477261797363616C65202D20436F6C6F727370616365>
Keyword: Tj
Keyword: ET
Variant: 340.157000
Variant: 795.779000
Variant: 167.172000
Variant: 17.875000
Keyword: re
Keyword: S
Variant: 0.000000
Keyword: w
Variant: 0.000000
Keyword: G
</snap>
So as you see, it can decide if it has read a keyword or a PdfVariant. Every
read PdfVariant is pushed on a stack. If a token like "m" for moveto is read,
the two top values are taken from the stack and used as parameters for
the "moveto" function.
So a complete parser might look like this:
void parse_contents( PdfContentsTokenizer* pTokenizer )
{
const char* pszToken = NULL;
PdfVariant var;
EPdfContentsType eType;
std::string str;
std::stack<PdfVariant> stack;
std::cout << std::endl << "Parsing a page:" << std::endl;
try
{
while( true )
{
pTokenizer->ReadNext( &eType, &pszToken, var );
if( eType == ePdfContentsType_Keyword )
{
std::cout << "Keyword: " << pszToken << std::endl;
// support 'l' and 'm' tokens
if( strcmp( pszToken, "l" ) == 0 )
{
double dPosY = stack.top().GetReal();
stack.pop();
double dPosX = stack.top().GetReal();
stack.pop();
std::cout << "LineTo: " << dPosX << " " << dPosY <<
std::endl;
}
else if( strcmp( pszToken, "m" ) == 0 )
{
double dPosY = stack.top().GetReal();
stack.pop();
double dPosX = stack.top().GetReal();
stack.pop();
std::cout << "MoveTo: " << dPosX << " " << dPosY <<
std::endl;
}
}
else
{
var.ToString( str );
std::cout << "Variant: " << str << std::endl;
stack.push( var );
}
}
}
catch( const PdfError & e )
{
if( e.GetError() == ePdfError_UnexpectedEOF )
return; // done with the stream
else
throw e;
}
}
Ok, you have to support more functions than "m" and "l" in the case of a
keyword. But I think it could work like this.
Did I miss something or is it really so easy? Apart from error checking and a
good algorithm to find a function and the number of parameters of course ....
If we do not use a simple stack, but a tree we should have the tree structure
craig was talking about! So, Craig - when will you add this to
PoDoFoBrowser? :) To be honest: if the PdfContentsTokenizer class seems
useful, I would like to add it to PoDoFo's main API.
best regards,
Dom
Am Wednesday 07 November 2007 schrieb Craig Ringer:
> Sargrad, Dave wrote:
> > Now that you've seen the pdf files that im currently interested and
> > understand that I'm willing to put in the effort to "do this right", and
> > to contribute something back to the community, please help me to
> > understand the appropriate initial characteristics (API) of the "content
> > stream parser".
>
> OOh, one quick idea that wouldn't be needed for rendering data from a
> content stream but would be really handy for editing and analysing it
> would be to have the option of using the low level content stream
> parser's output to populate a tree representation.
>
> Content streams are stack based with postfix operators, right - RPN
> style, in essence, ie
> 4 3 + 2 -
> .
>
> It's a fairly simple problem to read a RPN-style sequence into a tree
> with interior nodes as operators and leaves as operands. Such a tree
> could be easily created from the output of a content stream parser that
> worked like the one I was muttering about. Once you have the tree, you
> can do all sorts of interesting things with it. It'd make editing
> content streams much easier, and when you're done and want to write it
> out you just traverse the tree depth-first .
>
> Hmm. Fun. Not much good for rendering, probably, even if it was possible
> to read the stream one top-level child node at a time (which it should
> be). For editing, though... I'm going to have to add this to PoDoFoBrowser.
>
> Anyway, definitely sleep time.
>
> --
> Craig Ringer
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Podofo-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/podofo-users
--
**********************************************************************
Dominik Seichter - [EMAIL PROTECTED]
KRename - http://www.krename.net - Powerful batch renamer for KDE
KBarcode - http://www.kbarcode.net - Barcode and label printing
PoDoFo - http://podofo.sf.net - PDF generation and parsing library
SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game, for KDE
Alan - http://alan.sf.net - A Turing Machine in Java
**********************************************************************
pgpK6NACT5KAm.pgp
Description: PGP signature
------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
