Hi,
I have done something of that sort where I did filter out a few text
instructions from the stream, this can be done in this manner.
Suggestion to improve the following are more than welcome, particularly
the fact that I have to append spaces to strings.
char * buffer = new char[ pageStreamLength * N ];// This is bad
PdfOutputDevice pdfOutputDevice( buffer, pageStreamLength * N );
while( tokenizer.ReadNext( t, keyword, variant ) )
{
bool skip = false;
// Stack parameters up
if ( t != ePdfContentsType_Keyword )
{
stack.push( variant );
parameters++;
continue;
}
// Process the operators if we're interested on it
string k(keyword);
// Handle operators
if( k == ... ){
skip = true;
}
[..]
//
// Filter it out
if( skip )
{
while( ... ) stack.pop();
continue;
}
list< PdfVariant > l; //TODO: copy party
while( parameters > 0 )
{
l.push_front( stack.top() );
stack.pop();
parameters--;
}
// Write operands
for_each( l.begin(),
l.end(),
[ & ]( const PdfVariant & variant )
{
variant.Write( &pdfOutputDevice, ePdfWriteMode_Clean);
pdfOutputDevice.Write( " ", 1 );
} );
// Write operator
k += " ";
pdfOutputDevice.Write( k.c_str(), k.size() );
}
On 27/11/2019 08:06, Jacob Pedersen wrote:
Hi
I think this is the way to go, rather than copying images over.
Removing text could be done by either;
1. Replace chars in text draw commands with spaces. I guess it would be the
simplest and fastest approach if possible.
2. Remove text draw commands (basically anything between and including BT and
ET). Fonts can be left, since PDF will just be converted into TIFF, so it must
just be readable. Size does not matter since it will exist for a fraction of a
second in memory.
I have not found any examples of how to replace or edit the contents stream. I
can iterate through it using the tokenizer, but where to go from there?
Thanks for more pointers 😊
-----Oprindelig meddelelse-----
Fra: zyx <z...@gmx.us>
Sendt: 26. november 2019 07:51
Til: podofo-users@lists.sourceforge.net
Emne: Re: [Podofo-users] Example of text removal or image extraction
On Mon, 2019-11-25 at 18:04 +0000, Jacob Pedersen wrote:
Basically I just need to make an identical PDF without the text.
Hi,
it would be much harder, but maybe you can change the podofotxtextract tool and
instead of extracting only the text (see the TextExtractor::ExtractText there),
remove the text from the corresponding content streams, preserving all but the
text-related things. You might be able to remove also any font objects, because
they can mean significant part of the PDF file. Removing such objects, with all
its references in the document, can be tricky. The same editing of the content
streams.
Bye,
zyx
_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users
_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users
_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users