AJ - you seem to be mixing human terminology and PDF terminology.   A quick 
read of the relevant sections of the PDF standard will probably help.



Here is something I wrote for my upcoming book on PDF that might be helpful to 
you:

As described in the previous chapter, a PDF file is composed of one or more 
pages (of a fixed size), and the visible elements on each page come from either 
the page content or a series of annotations that sit on top (visibly) of the 
content.  This chapter discusses the page content.



Page content is described using a special text-based syntax (related, but 
different, from the PDF file syntax that you learned about in an earlier 
chapter) which are stored in the PDF inside of a special type of stream object 
called a "Content Stream".  The content syntax is derived from Adobe's 
Postscript language and is comprised of a series of operators and their 
operands, where each operand can be expressed as a standard PDF object.



Given the above, you have a PDF page that consists of a Content Stream which 
has operators that tell the PDF reader to draw an image and also draw some 
text.  Think of it as something like the following (which doesn't represent 
reality, but should help you)

                SaveState

                                SetImageLocation

                                DrawImage

                RestoreState

                SaveState

                                SetTextLocation

                                SetTextFontAndSize

                                DrawText

                RestoreState



If you removed that entire Content Stream (the value of the /Contents on the 
Page object), you'd lose BOTH the image and the text.  Since you only want to 
lose the text, you would need to scan/parse/analyze the Content Stream, find 
the "text parts" and remove them.



Does that make more sense??



Leonard



-----Original Message-----
From: AJ Weber [mailto:awe...@comcast.net]
Sent: Wednesday, February 15, 2012 10:01 AM
To: itext-questions@lists.sourceforge.net
Subject: Re: [iText-questions] Strip Annotations?



It should not give me a blank page.  The page is actually a scanned image -- or 
somehow the entire mediabox is filled with an image of the document (I say 
somehow, because the Producer info says iText 2.1.4, but there's no actual 
Contents in the original document's page).



The "/Contents" of the page is actually what a user entered using the Acrobat 
Std/Pro "Touch-Up Text Tool".  That's all.  The actual text-content of the 
document isn't text at all; like I said, it's a single image that fills the 
entire mediabox.  Basically, they use that tool instead of a more appropriate 
Annotation mechanism such as a "stamp" or "text box".



Thus the /Contents and the actual document's content is entirely different.



Since the actual content of the document is an image, we are sending it to an 
OCR step.  If there is something in the /Contents, the OCR engine assumes there 
is no need to OCR and the result is virtually the same output PDF.  I need to 
remove that /Contents object so the OCR engine detects that it needs to OCR the 
underlying image; not rely upon the exiting text.



So I would expect that if I CAN remove the /Contents object from a page, and 
there is still an image filling the mediabox for that page, we would still have 
that displayed (and OCR'ed correctly).





On 2/15/2012 9:32 AM, Leonard Rosenthol wrote:

> You can't remove the entire stream - that would give you a blank page!

>

> As Bruno said, you need to parse/analyze the page content and determine what 
> is "good" and what is "bad".

>

> Leonard

>

> -----Original Message-----

> From: AJ Weber [mailto:awe...@comcast.net]<mailto:[mailto:awe...@comcast.net]>

> Sent: Wednesday, February 15, 2012 9:27 AM

> To: 
> itext-questions@lists.sourceforge.net<mailto:itext-questions@lists.sourceforge.net>

> Subject: Re: [iText-questions] Strip Annotations?

>

> On 2/14/2012 11:02 AM, Leonard Rosenthol wrote:

>> Sure, it's possible that they are using some tool that adds text directly to 
>> the content instead of as an annotation.  Perfectly valid.

>>

>> In which case, removal is MUCH harder (but not impossible)

> OK...if I need to remove a page's /Contents object (and thus stream), can 
> anyone point to a quick method to do that?  Do I need to use one of the 
> "lower level" methods, or which class/method would be recommended?

>

> Thanks again,

> AJ

>

> ----------------------------------------------------------------------

> -------- Virtualization&  Cloud Management Using Capacity Planning

> Cloud computing makes use of virtualization - but cloud computing also 
> focuses on allowing computing to be delivered as a service.

> http://www.accelacomm.com/jaw/sfnl/114/51521223/

> _______________________________________________

> iText-questions mailing list

> iText-questions@lists.sourceforge.net<mailto:iText-questions@lists.sourceforge.net>

> https://lists.sourceforge.net/lists/listinfo/itext-questions

>

> iText(R) is a registered trademark of 1T3XT BVBA.

> Many questions posted to this list can (and will) be answered with a

> reference to the iText book: http://www.itextpdf.com/book/ Please

> check the keywords list before you ask for examples:

> http://itextpdf.com/themes/keywords.php

>

> ----------------------------------------------------------------------

> -------- Virtualization&  Cloud Management Using Capacity Planning

> Cloud computing makes use of virtualization - but cloud computing also

> focuses on allowing computing to be delivered as a service.

> http://www.accelacomm.com/jaw/sfnl/114/51521223/

> _______________________________________________

> iText-questions mailing list

> iText-questions@lists.sourceforge.net<mailto:iText-questions@lists.sourceforge.net>

> https://lists.sourceforge.net/lists/listinfo/itext-questions

>

> iText(R) is a registered trademark of 1T3XT BVBA.

> Many questions posted to this list can (and will) be answered with a

> reference to the iText book: http://www.itextpdf.com/book/ Please

> check the keywords list before you ask for examples:

> http://itextpdf.com/themes/keywords.php



------------------------------------------------------------------------------

Virtualization & Cloud Management Using Capacity Planning Cloud computing makes 
use of virtualization - but cloud computing also focuses on allowing computing 
to be delivered as a service.

http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________

iText-questions mailing list

iText-questions@lists.sourceforge.net<mailto:iText-questions@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/itext-questions



iText(R) is a registered trademark of 1T3XT BVBA.

Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/ Please check the keywords list 
before you ask for examples: http://itextpdf.com/themes/keywords.php
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to