Re: reading a PDF

2008-11-30 Thread Antonio Nunes

On Nov 29, 2008, at 6:42 PM, Torsten Curdt wrote:


I am a little stuck here. I am trying to extract (and later modify)
text and images in a PDF. First I turned to PDFkit but that seems to
be way to high level for these things. Now I am trying with Quartz.


Hi Torsten,

Quartz is indeed the level at which you'll need to work with this.


There is a CGPDFScannerScan but that doesn't look right either.


It's definitely part of the puzzle. Essentially

You'll need to create an operator table:
CGPDFOperatorTableRef table = CGPDFOperatorTableCreate();

Then add the PDF operators you're interested in to the table like this:
// Close, fill, and stroke path using nonzero winding number rule
CGPDFOperatorTableSetCallback(table, b, operator_b);

…where  operator_b is your callback function, and b is the name of  
the operator.


Obtain the content stream for the page, and create a scanner to scan it:

	CGPDFContentStreamRef contentStream =  
CGPDFContentStreamCreateWithPage(cgPage);
	CGPDFScannerRef scanner = CGPDFScannerCreate(contentStream, table,  
self);


You can pass a pointer to the object from which you scan (likely  
self) in the last argument, so that you can easily reach back into  
the higher level environment.


Finally start scanning:

 CGPDFScannerScan(scanner);

After this call, whenever the scanner finds an operator that you added  
to the operator table it will call your callback function. One option  
is to simply hook back into Objective-C level, passing the scanner to  
the appropriate method, and perform your parsing there:


void operator_b(CGPDFScannerRef scanner, void *info)
{
[(MyScanningObject *)info operator_b_withScanner:scanner];
}

- (void)operator_b_withScanner:(CGPDFScannerRef)scanner
{
// Do whatever is appropriate
}


António


Today you are You, that is truer than true.
There is no one alive who is Youer than You.
Today I am Me, and I am freer than free.
There is no one alive who is Me-er than Me.
I am the BEST I can possibly be.

--Dr. Seuss



___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: reading a PDF

2008-11-30 Thread Joel Norvell
Antonio,

Thank you for laying out this recipe; you really know this stuff! (Not that 
there was any question, given your flagship application PDFClerk :-)

Thank you for deepening my understanding of working with PDFs!


Torsten,

The code I mentioned to you Creating and Examining PDF Documents is available 
online, something that eluded me in my initial note in this thread.

It's in the Programming with Quartz code which David Gelphman has generously 
made available online :-)

http://tinyurl.com/5b2c9w


Namaste,
Joel

http://frameworker.wordpress.com




  
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


reading a PDF

2008-11-29 Thread Torsten Curdt
Hey folks,

I am a little stuck here. I am trying to extract (and later modify)
text and images in a PDF. First I turned to PDFkit but that seems to
be way to high level for these things. Now I am trying with Quartz.

While I get to the CGPDFPageRef

int pages = CGPDFDocumentGetNumberOfPages(doc);
for(int p = 1; p=pages; p++) {
CGPDFPageRef page = CGPDFDocumentGetPage(doc, p);

I just assume that the actual content is hidden inside the page's
content stream(s).

Currently I am going through the VoyeurNode example but I still
can't seem to find where to get hold of the actual content.
There is a CGPDFScannerScan but that doesn't look right either.

I've looked at

http://developer.apple.com/documentation/GraphicsImaging/Conceptual/PDFKitGuide/PDFKit_Prog_Intro/chapter_1_section_1.html
http://developer.apple.com/documentation/graphicsimaging/reference/CGPDFContentStream/Reference/reference.html#//apple_ref/doc/uid/TP40001407-CH1g-SW3

and in particular at

http://developer.apple.com/documentation/GraphicsImaging/Conceptual/drawingwithquartz2d/dq_pdf_scan/chapter_15_section_3.html

Any other pointers?

cheers
--
Torsten
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: reading a PDF

2008-11-29 Thread Scott Ribe
 I just assume that the actual content is hidden inside the page's
 content stream(s).

Raw content, mostly, sometimes. But the draw commands are what put it all
together.

For instance, you might have a paragraph of text where there is one draw
command per line, or you might have a paragraph of text where is one draw
command per character. For an image that fills the page, you might have one
content stream and one draw command, or you might have multiple image slices
with one content stream and one draw command for each slice.

IOW, what you want is not so simple.

-- 
Scott Ribe
[EMAIL PROTECTED]
http://www.killerbytes.com/
(303) 722-0567 voice


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: reading a PDF

2008-11-29 Thread Torsten Curdt
 I just assume that the actual content is hidden inside the page's
 content stream(s).

 Raw content, mostly, sometimes. But the draw commands are what put it all
 together.

 For instance, you might have a paragraph of text where there is one draw
 command per line, or you might have a paragraph of text where is one draw
 command per character.

Getting to the individual draw commands for the text/characters would
be a first step ...and maybe even enough for what I am after. Is this
what the CGPDFOperatorTableSetCallback() is for?

 For an image that fills the page, you might have one
 content stream and one draw command, or you might have multiple image slices
 with one content stream and one draw command for each slice.

Would a PDF writer really slice the images up?

 IOW, what you want is not so simple.

I see.

Well, I probably don't really need the image extraction
Just getting the text draw commands might suffice.

cheers
--
Torsten
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: reading a PDF

2008-11-29 Thread Ricky Sharp


On Nov 29, 2008, at 3:16 PM, Torsten Curdt wrote:


I just assume that the actual content is hidden inside the page's
content stream(s).


Raw content, mostly, sometimes. But the draw commands are what put  
it all

together.

For instance, you might have a paragraph of text where there is one  
draw
command per line, or you might have a paragraph of text where is  
one draw

command per character.


Getting to the individual draw commands for the text/characters would
be a first step ...and maybe even enough for what I am after. Is this
what the CGPDFOperatorTableSetCallback() is for?


For an image that fills the page, you might have one
content stream and one draw command, or you might have multiple  
image slices

with one content stream and one draw command for each slice.


Would a PDF writer really slice the images up?


IOW, what you want is not so simple.


I see.

Well, I probably don't really need the image extraction
Just getting the text draw commands might suffice.



At my day job, we use pdfbox (see www.pdfbox.org) in automated tests.   
It basically grabs raw textual data and spits out two-dimensional  
arrays of strings.


While it's java based, it may shed a light on how text extraction can  
be done.  I do not, however, know if their licensing model will fit  
your needs (i.e. if you base your code on theirs, is that even allowed).


There's some links on their site (http://www.pdfbox.org/ 
references.html) which shows how someone wrote a Cocoa app and used  
the Java bridge to interface with pdfbox.


___
Ricky A. Sharp mailto:[EMAIL PROTECTED]
Instant Interactive(tm)   http://www.instantinteractive.com



___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: reading a PDF

2008-11-29 Thread Torsten Curdt
 At my day job, we use pdfbox (see www.pdfbox.org) in automated tests.  It
 basically grabs raw textual data and spits out two-dimensional arrays of
 strings.

 While it's java based, it may shed a light on how text extraction can be
 done.  I do not, however, know if their licensing model will fit your needs
 (i.e. if you base your code on theirs, is that even allowed).

Yes, already had a look at that. But I was hoping I don't have to
spend the time to translate that to Objective-C myself.

It's BSD licensed - so license-wise that would be fine.

 There's some links on their site (http://www.pdfbox.org/references.html)
 which shows how someone wrote a Cocoa app and used the Java bridge to
 interface with pdfbox.

Uh ...interesting. Was hoping for something native though.

Thanks for the pointer!

cheers
--
Torsten
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: reading a PDF

2008-11-29 Thread Joel Norvell
Hi Torsten,

I haven't looked extensively into the latest Apple documentation on this 
subject and wouldn't be surprised if it has been rolled in recently, but 
David Gelphman's Programming with Quartz does go into this in some detail.  

Chapter 14, Creating and Examining PDF Documents, contains a listing (Listing 
14.11) of fairly extensive code showing how to count and categorize the images 
used on each page of a PDF document.

Also, the quartz-dev mailing list would be another good place to discuss your 
topic.

Best Wishes for Technical Success!

Joel




  
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]