AJ - you seem to be mixing human terminology and PDF terminology. A quick
read of the relevant sections of the PDF standard will probably help.
Here is something I wrote for my upcoming book on PDF that might be helpful to
you:
As described in the previous chapter, a PDF file is composed of one or more
pages (of a fixed size), and the visible elements on each page come from either
the page content or a series of annotations that sit on top (visibly) of the
content. This chapter discusses the page content.
Page content is described using a special text-based syntax (related, but
different, from the PDF file syntax that you learned about in an earlier
chapter) which are stored in the PDF inside of a special type of stream object
called a "Content Stream". The content syntax is derived from Adobe's
Postscript language and is comprised of a series of operators and their
operands, where each operand can be expressed as a standard PDF object.
Given the above, you have a PDF page that consists of a Content Stream which
has operators that tell the PDF reader to draw an image and also draw some
text. Think of it as something like the following (which doesn't represent
reality, but should help you)
SaveState
SetImageLocation
DrawImage
RestoreState
SaveState
SetTextLocation
SetTextFontAndSize
DrawText
RestoreState
If you removed that entire Content Stream (the value of the /Contents on the
Page object), you'd lose BOTH the image and the text. Since you only want to
lose the text, you would need to scan/parse/analyze the Content Stream, find
the "text parts" and remove them.
Does that make more sense??
Leonard
-----Original Message-----
From: AJ Weber [mailto:awe...@comcast.net]
Sent: Wednesday, February 15, 2012 10:01 AM
To: itext-questions@lists.sourceforge.net
Subject: Re: [iText-questions] Strip Annotations?
It should not give me a blank page. The page is actually a scanned image -- or
somehow the entire mediabox is filled with an image of the document (I say
somehow, because the Producer info says iText 2.1.4, but there's no actual
Contents in the original document's page).
The "/Contents" of the page is actually what a user entered using the Acrobat
Std/Pro "Touch-Up Text Tool". That's all. The actual text-content of the
document isn't text at all; like I said, it's a single image that fills the
entire mediabox. Basically, they use that tool instead of a more appropriate
Annotation mechanism such as a "stamp" or "text box".
Thus the /Contents and the actual document's content is entirely different.
Since the actual content of the document is an image, we are sending it to an
OCR step. If there is something in the /Contents, the OCR engine assumes there
is no need to OCR and the result is virtually the same output PDF. I need to
remove that /Contents object so the OCR engine detects that it needs to OCR the
underlying image; not rely upon the exiting text.
So I would expect that if I CAN remove the /Contents object from a page, and
there is still an image filling the mediabox for that page, we would still have
that displayed (and OCR'ed correctly).
On 2/15/2012 9:32 AM, Leonard Rosenthol wrote:
> You can't remove the entire stream - that would give you a blank page!
>
> As Bruno said, you need to parse/analyze the page content and determine what
> is "good" and what is "bad".
>
> Leonard
>
> -----Original Message-----
> From: AJ Weber [mailto:awe...@comcast.net]<mailto:[mailto:awe...@comcast.net]>
> Sent: Wednesday, February 15, 2012 9:27 AM
> To:
> itext-questions@lists.sourceforge.net<mailto:itext-questions@lists.sourceforge.net>
> Subject: Re: [iText-questions] Strip Annotations?
>
> On 2/14/2012 11:02 AM, Leonard Rosenthol wrote:
>> Sure, it's possible that they are using some tool that adds text directly to
>> the content instead of as an annotation. Perfectly valid.
>>
>> In which case, removal is MUCH harder (but not impossible)
> OK...if I need to remove a page's /Contents object (and thus stream), can
> anyone point to a quick method to do that? Do I need to use one of the
> "lower level" methods, or which class/method would be recommended?
>
> Thanks again,
> AJ
>
> ----------------------------------------------------------------------
> -------- Virtualization& Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing also
> focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net<mailto:iText-questions@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a
> reference to the iText book: http://www.itextpdf.com/book/ Please
> check the keywords list before you ask for examples:
> http://itextpdf.com/themes/keywords.php
>
> ----------------------------------------------------------------------
> -------- Virtualization& Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing also
> focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net<mailto:iText-questions@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a
> reference to the iText book: http://www.itextpdf.com/book/ Please
> check the keywords list before you ask for examples:
> http://itextpdf.com/themes/keywords.php
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning Cloud computing makes
use of virtualization - but cloud computing also focuses on allowing computing
to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net<mailto:iText-questions@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/ Please check the keywords list
before you ask for examples: http://itextpdf.com/themes/keywords.php
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php