Re: Text removal

a7med shre3y Tue, 24 Mar 2015 01:58:03 -0700

You can download it from here:
https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing


Best Regards,


On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <[email protected]>
wrote:

>
>
> > Am 24.03.2015 um 09:40 schrieb a7med shre3y <[email protected]>:
> >
> > Hi,
> >
> > In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
> > Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
> > thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
> > not the opposite (or at least I don't know). I spent some quite long time
> > trying to find out how to find the character codes for the glyphs in the
> > currently used font, then I found that it's not an easy task. By the way,
> > if you know how to do that, I'd so much appreciate it because I need that
> > for replacing text with another text and for that the new text must be
> > encoded the same way as the original!
> >
> > Back to the text removal, I am able to find the text and also remove it
> by
> > calling reset, as I mentioned in my first email, when I print the output
> > content I don't find the text anymore but I still see it when I open the
> > file. My first assumption was that there must be some other way to remove
> > the text other than the way I am using, and that's what you've actually
> > confirmed in your reply, so could you please tell me what still missing?
> >
>
> Could you upload the PDF with the reset text too?
>
> BR
> Maruan
>
>
> > Thanks and regards,
> > a7mad
> >
> > On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <[email protected]>:
> >>>
> >>> Hi,
> >>>
> >>> Here's how I do it:
> >>>
> >>> 1. I use the following method to encode the text:
> >>>
> >>> String encode(String text, PDFont font) throws Exception {
> >>>       StringBuilder builder = new StringBuilder();
> >>>       byte[] stringBytes = text.getBytes();
> >>>       int codeLength = 1;
> >>>       for(int i = 0; i < stringBytes.length; i += codeLength){
> >>>               String c = font.encode(stringBytes, i, codeLength);
> >>>               if(c == null && (i + 1 < stringBytes.length)){
> >>>                   codeLength++;
> >>>                   c = font.encode(stringBytes, i, codeLength);
> >>>               }
> >>>               builder.append(c);
> >>>           }
> >>>       return builder.toString();
> >>>   }
> >>>
> >>> 2. Iterating through the tokens, I find the text either it's a
> COSString
> >>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the
> text
> >>> I'm looking for to remove as following:
> >>>
> >>> if (op.getOperation().equals("Tj")) {
> >>>                           COSString previous = (COSString) tokens.get(j
> >> -
> >>> 1);
> >>>                           String string = previous.getString();
> >>>                           String encodedString = encode(string, font);
> >>
> >> that string is already encoded. So you'd need to encode "To Be Approved"
> >> and compare if that matches the string you are reading from the PDF.
> >>
> >>>                           if(encodedString.contains("To Be Approved")){
> >>>                               previous.reset();
> >>>                           }
> >>>                       } else if (op.getOperation().equals("TJ")) {
> >>>                           COSArray previous = (COSArray) tokens.get(j -
> >>> 1);
> >>>                           StringBuilder stringBuilder = new
> >>> StringBuilder();
> >>>                           for (int k = 0; k < previous.size(); k++) {
> >>>                               Object arrElement =
> previous.getObject(k);
> >>>                               if (arrElement instanceof COSString) {
> >>>                                   COSString cosString = (COSString)
> >>> arrElement;
> >>>
> >>> stringBuilder.append(cosString.getString());
> >>>                               }
> >>>                           }
> >>>                           String string = stringBuilder.toString();
> >>>                           String encodedString = encode(string, font);
> >>>                           if(encodedString.contains("To Be Approved")){
> >>>                               previous.clear();
> >>>                           }
> >>>                       }
> >>>
> >>> Note:
> >>> In case of COSArray, I first iterate through the whole array to get the
> >>> whole string before encoding and comparison and this works.
> >>>
> >>> Best Regards,
> >>> a7mad
> >>>
> >>>
> >>>
> >>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
> [email protected]
> >>>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> your text is encoded so within the show text operator Tj the string is
> >>>>
> >>>> 7R %H $SSURYHG
> >>>>
> >>>> You wrote that you encode your string to find it - what do you get?
> >>>>
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>
> >>>>
> >>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <[email protected]
> >:
> >>>>>
> >>>>> Hi Maruan,
> >>>>>
> >>>>> Here's a link from where you can download the PDF.
> >>>>>
> >>>>>
> >>>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >>>>>
> >>>>> Kind Regards,
> >>>>> a7mad
> >>>>>
> >>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
> >> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> you need to upload it to a public location as the mailing list
> doesn't
> >>>>>> support attachments.
> >>>>>>
> >>>>>> BR
> >>>>>> Maruan
> >>>>>>
> >>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <
> [email protected]
> >>> :
> >>>>>>>
> >>>>>>> Dear Maruan,
> >>>>>>>
> >>>>>>> Thank you very much for the information. Please find herewith
> >> attached
> >>>>>> the PDF to reproduce the problem.
> >>>>>>> The text to remove is: "To Be Approved". The text has a multi-byte
> >>>>>> encoding, so I call first to encode it in order to find it then
> remove
> >>>> it.
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> a7mad
> >>>>>>>
> >>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
> >>>> [email protected]>
> >>>>>> wrote:
> >>>>>>>> Dear a7mad,
> >>>>>>>>
> >>>>>>>> removing text from a PDF is not an easy task as
> >>>>>>>> - text which might visually appear as a single item might
> consistent
> >>>> of
> >>>>>> individual parts within the PDF itself e.g. each character or groups
> >> of
> >>>>>> characters are place individually in different COSStrings
> >>>>>>>> - text might be drawn using graphics commands
> >>>>>>>> - text can appear within different parts of the PDF (e.g. the text
> >>>>>> might be content of a form field AND the annotation representing the
> >>>> form
> >>>>>> field visually)
> >>>>>>>> - you need to look up the encoding information to get form the
> >>>>>> characters in the PDF "string" to the ones you are looking for
> >>>>>>>> ….
> >>>>>>>>
> >>>>>>>> If you can post a specific PDF to a public location and describe
> in
> >>>>>> detail which string should have been replaced which hasn't I will be
> >>>> able
> >>>>>> to tell you why that might have happened.
> >>>>>>>>
> >>>>>>>> Maruan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <
> >> [email protected]
> >>>>> :
> >>>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> Currently I am facing a strange problem removing text from the
> some
> >>>>>> PDFs.
> >>>>>>>>> My program is able to find the text and "remove it" by calling
> the
> >>>>>>>>> COSString.reset() method.
> >>>>>>>>> The problem is, when I open the output PDF file, I still see the
> >> text
> >>>>>> but
> >>>>>>>>> not selectable (I mean when I try to highlight it with the mouse
> to
> >>>>>> copy
> >>>>>>>>> it, it's not selectable!). When print the content (tokens) of the
> >>>>>> output
> >>>>>>>>> file, I DO NOT find the text at all!!
> >>>>>>>>>
> >>>>>>>>> I am currently stuck in the PDF specifications 1.5 and really
> >> running
> >>>>>> out
> >>>>>>>>> of time.
> >>>>>>>>>
> >>>>>>>>> I'd so much appreciate any help or any idea on what's going on.
> >>>>>>>>>
> >>>>>>>>> Notes:
> >>>>>>>>> 1. I use use PDFBox 1.7.1
> >>>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs
> cause
> >>>>>> this
> >>>>>>>>> problem.
> >>>>>>>>>
> >>>>>>>>> Thank you very much.
> >>>>>>>>> a7mad
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Text removal

Reply via email to