Re: reading a PDF
On Nov 29, 2008, at 6:42 PM, Torsten Curdt wrote: I am a little stuck here. I am trying to extract (and later modify) text and images in a PDF. First I turned to PDFkit but that seems to be way to high level for these things. Now I am trying with Quartz. Hi Torsten, Quartz is indeed the level at which you'll need to work with this. There is a CGPDFScannerScan but that doesn't look right either. It's definitely part of the puzzle. Essentially You'll need to create an operator table: CGPDFOperatorTableRef table = CGPDFOperatorTableCreate(); Then add the PDF operators you're interested in to the table like this: // Close, fill, and stroke path using nonzero winding number rule CGPDFOperatorTableSetCallback(table, b, operator_b); …where operator_b is your callback function, and b is the name of the operator. Obtain the content stream for the page, and create a scanner to scan it: CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(cgPage); CGPDFScannerRef scanner = CGPDFScannerCreate(contentStream, table, self); You can pass a pointer to the object from which you scan (likely self) in the last argument, so that you can easily reach back into the higher level environment. Finally start scanning: CGPDFScannerScan(scanner); After this call, whenever the scanner finds an operator that you added to the operator table it will call your callback function. One option is to simply hook back into Objective-C level, passing the scanner to the appropriate method, and perform your parsing there: void operator_b(CGPDFScannerRef scanner, void *info) { [(MyScanningObject *)info operator_b_withScanner:scanner]; } - (void)operator_b_withScanner:(CGPDFScannerRef)scanner { // Do whatever is appropriate } António Today you are You, that is truer than true. There is no one alive who is Youer than You. Today I am Me, and I am freer than free. There is no one alive who is Me-er than Me. I am the BEST I can possibly be. --Dr. Seuss ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]
Re: reading a PDF
Antonio, Thank you for laying out this recipe; you really know this stuff! (Not that there was any question, given your flagship application PDFClerk :-) Thank you for deepening my understanding of working with PDFs! Torsten, The code I mentioned to you Creating and Examining PDF Documents is available online, something that eluded me in my initial note in this thread. It's in the Programming with Quartz code which David Gelphman has generously made available online :-) http://tinyurl.com/5b2c9w Namaste, Joel http://frameworker.wordpress.com ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]
reading a PDF
Hey folks, I am a little stuck here. I am trying to extract (and later modify) text and images in a PDF. First I turned to PDFkit but that seems to be way to high level for these things. Now I am trying with Quartz. While I get to the CGPDFPageRef int pages = CGPDFDocumentGetNumberOfPages(doc); for(int p = 1; p=pages; p++) { CGPDFPageRef page = CGPDFDocumentGetPage(doc, p); I just assume that the actual content is hidden inside the page's content stream(s). Currently I am going through the VoyeurNode example but I still can't seem to find where to get hold of the actual content. There is a CGPDFScannerScan but that doesn't look right either. I've looked at http://developer.apple.com/documentation/GraphicsImaging/Conceptual/PDFKitGuide/PDFKit_Prog_Intro/chapter_1_section_1.html http://developer.apple.com/documentation/graphicsimaging/reference/CGPDFContentStream/Reference/reference.html#//apple_ref/doc/uid/TP40001407-CH1g-SW3 and in particular at http://developer.apple.com/documentation/GraphicsImaging/Conceptual/drawingwithquartz2d/dq_pdf_scan/chapter_15_section_3.html Any other pointers? cheers -- Torsten ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]
Re: reading a PDF
I just assume that the actual content is hidden inside the page's content stream(s). Raw content, mostly, sometimes. But the draw commands are what put it all together. For instance, you might have a paragraph of text where there is one draw command per line, or you might have a paragraph of text where is one draw command per character. For an image that fills the page, you might have one content stream and one draw command, or you might have multiple image slices with one content stream and one draw command for each slice. IOW, what you want is not so simple. -- Scott Ribe [EMAIL PROTECTED] http://www.killerbytes.com/ (303) 722-0567 voice ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]
Re: reading a PDF
I just assume that the actual content is hidden inside the page's content stream(s). Raw content, mostly, sometimes. But the draw commands are what put it all together. For instance, you might have a paragraph of text where there is one draw command per line, or you might have a paragraph of text where is one draw command per character. Getting to the individual draw commands for the text/characters would be a first step ...and maybe even enough for what I am after. Is this what the CGPDFOperatorTableSetCallback() is for? For an image that fills the page, you might have one content stream and one draw command, or you might have multiple image slices with one content stream and one draw command for each slice. Would a PDF writer really slice the images up? IOW, what you want is not so simple. I see. Well, I probably don't really need the image extraction Just getting the text draw commands might suffice. cheers -- Torsten ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]
Re: reading a PDF
On Nov 29, 2008, at 3:16 PM, Torsten Curdt wrote: I just assume that the actual content is hidden inside the page's content stream(s). Raw content, mostly, sometimes. But the draw commands are what put it all together. For instance, you might have a paragraph of text where there is one draw command per line, or you might have a paragraph of text where is one draw command per character. Getting to the individual draw commands for the text/characters would be a first step ...and maybe even enough for what I am after. Is this what the CGPDFOperatorTableSetCallback() is for? For an image that fills the page, you might have one content stream and one draw command, or you might have multiple image slices with one content stream and one draw command for each slice. Would a PDF writer really slice the images up? IOW, what you want is not so simple. I see. Well, I probably don't really need the image extraction Just getting the text draw commands might suffice. At my day job, we use pdfbox (see www.pdfbox.org) in automated tests. It basically grabs raw textual data and spits out two-dimensional arrays of strings. While it's java based, it may shed a light on how text extraction can be done. I do not, however, know if their licensing model will fit your needs (i.e. if you base your code on theirs, is that even allowed). There's some links on their site (http://www.pdfbox.org/ references.html) which shows how someone wrote a Cocoa app and used the Java bridge to interface with pdfbox. ___ Ricky A. Sharp mailto:[EMAIL PROTECTED] Instant Interactive(tm) http://www.instantinteractive.com ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]
Re: reading a PDF
At my day job, we use pdfbox (see www.pdfbox.org) in automated tests. It basically grabs raw textual data and spits out two-dimensional arrays of strings. While it's java based, it may shed a light on how text extraction can be done. I do not, however, know if their licensing model will fit your needs (i.e. if you base your code on theirs, is that even allowed). Yes, already had a look at that. But I was hoping I don't have to spend the time to translate that to Objective-C myself. It's BSD licensed - so license-wise that would be fine. There's some links on their site (http://www.pdfbox.org/references.html) which shows how someone wrote a Cocoa app and used the Java bridge to interface with pdfbox. Uh ...interesting. Was hoping for something native though. Thanks for the pointer! cheers -- Torsten ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]
Re: reading a PDF
Hi Torsten, I haven't looked extensively into the latest Apple documentation on this subject and wouldn't be surprised if it has been rolled in recently, but David Gelphman's Programming with Quartz does go into this in some detail. Chapter 14, Creating and Examining PDF Documents, contains a listing (Listing 14.11) of fairly extensive code showing how to count and categorize the images used on each page of a PDF document. Also, the quartz-dev mailing list would be another good place to discuss your topic. Best Wishes for Technical Success! Joel ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]