Hi,

I have done something of that sort where I did filter out a few text instructions from the stream, this can be done in this manner.

Suggestion to improve the following are more than welcome, particularly the fact that I have to append spaces to strings.


  char * buffer = new char[ pageStreamLength * N ];// This is bad
  PdfOutputDevice pdfOutputDevice( buffer, pageStreamLength * N );


  while( tokenizer.ReadNext( t, keyword, variant ) )
      {
        bool skip = false;

        // Stack parameters up
        if ( t != ePdfContentsType_Keyword )
        {
          stack.push( variant );
          parameters++;
          continue;
        }

        // Process the operators if we're interested on it
        string k(keyword);

        // Handle operators
        if( k == ... ){
                skip = true;
        }
        [..]
        //

        // Filter it out
        if( skip )
        {
                while( ... ) stack.pop();
                continue;
        }

        list< PdfVariant > l; //TODO: copy party
        while( parameters > 0 )
        {
          l.push_front( stack.top() );
          stack.pop();
          parameters--;
        }

        // Write operands
        for_each( l.begin(),
                 l.end(),
                 [ & ]( const PdfVariant & variant  )
                 {
                  variant.Write( &pdfOutputDevice, ePdfWriteMode_Clean);
                  pdfOutputDevice.Write( " ", 1 );
                 } );

        // Write operator
        k += " ";
        pdfOutputDevice.Write( k.c_str(), k.size() );
      }

On 27/11/2019 08:06, Jacob Pedersen wrote:
Hi

I think this is the way to go, rather than copying images over.

Removing text could be done by either;

1. Replace chars in text draw commands with spaces. I guess it would be the 
simplest and fastest approach if possible.
2. Remove text draw commands (basically anything between and including BT and 
ET). Fonts can be left, since PDF will just be converted into TIFF, so it must 
just be readable. Size does not matter since it will exist for a fraction of a 
second in memory.

I have not found any examples of how to replace or edit the contents stream. I 
can iterate through it using the tokenizer, but where to go from there?

Thanks for more pointers 😊

-----Oprindelig meddelelse-----
Fra: zyx <z...@gmx.us>
Sendt: 26. november 2019 07:51
Til: podofo-users@lists.sourceforge.net
Emne: Re: [Podofo-users] Example of text removal or image extraction

On Mon, 2019-11-25 at 18:04 +0000, Jacob Pedersen wrote:
Basically I just need to make an identical PDF without the text.

        Hi,
it would be much harder, but maybe you can change the podofotxtextract tool and 
instead of extracting only the text (see the TextExtractor::ExtractText there), 
remove the text from the corresponding content streams, preserving all but the 
text-related things. You might be able to remove also any font objects, because 
they can mean significant part of the PDF file. Removing such objects, with all 
its references in the document, can be tricky. The same editing of the content 
streams.
        Bye,
        zyx



_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users



_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to