Sargrad, Dave wrote: > "It would be absoltely wonderful if as part of your work you ended up writing > even a rudimentary content stream parser that was self contained enough to be > included in PoFoFo ." > > Great. I would love to contribute a content stream parser. I don't quite know > what this means yet, but perhaps we can talk about the proper API (from your > perspective). > > With your mentoring I may be able to contribute a component to the podofo > project that you find useful. > > "Looking at the attached PDF, I think it's safe to say you can handle a very > restricted subset of PDF and still be OK. I begin to see why you're doing it > the way you are. A content stream parser for that shouldn't be too hard to > write at all by the looks." > > This was my impression/hope as well. I want to start simple, and yet put > myself on a road to increasingly understand/use pdf. > > Now that you've seen the pdf files that im currently interested and > understand that I'm willing to put in the effort to "do this right", and to > contribute something back to the community, please help me to understand the > appropriate initial characteristics (API) of the "content stream parser".
Honestly, in this case the best thing you can probably do is read parts of the PDF Reference. In my distinctly non-expert view I'd recommend: Overview - section 2.1 and the overview intro Skim reading section 3.1-3.4, 3.6 & 3.8 . Looking at some sample PDFs might be helpful here. PoDoFoBrowser or podofouncompress might be handy. I'd unsurprisingly recommend reading section 3.7 "Content Streams" in detail. To my mind, a basic content stream parser should be able to read a content stream (just a byte sequence as far as anything else is concerned) and as a first stage produce a stream of tokens. That's trivial since content streams use whitespace separators. Those could then be processed to identify operators, convert int/float tokens to real numeric values, etc, giving you a stream of content stream elements that code could actually do something useful with. From there... I'm not sure what the best way is. Getting that far should be pretty trivial though. I'm itching to have a go at it myself now that I've actually looked at it, but it's now 4am and sleep is no longer optional. -- Craig Ringer ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
