Hi, In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To Be Approved" as "encoding". Anyway, either it's encoding or decoding, I thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and not the opposite (or at least I don't know). I spent some quite long time trying to find out how to find the character codes for the glyphs in the currently used font, then I found that it's not an easy task. By the way, if you know how to do that, I'd so much appreciate it because I need that for replacing text with another text and for that the new text must be encoded the same way as the original!
Back to the text removal, I am able to find the text and also remove it by calling reset, as I mentioned in my first email, when I print the output content I don't find the text anymore but I still see it when I open the file. My first assumption was that there must be some other way to remove the text other than the way I am using, and that's what you've actually confirmed in your reply, so could you please tell me what still missing? Thanks and regards, a7mad On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <[email protected]> wrote: > Hi, > > > Am 24.03.2015 um 08:14 schrieb a7med shre3y <[email protected]>: > > > > Hi, > > > > Here's how I do it: > > > > 1. I use the following method to encode the text: > > > > String encode(String text, PDFont font) throws Exception { > > StringBuilder builder = new StringBuilder(); > > byte[] stringBytes = text.getBytes(); > > int codeLength = 1; > > for(int i = 0; i < stringBytes.length; i += codeLength){ > > String c = font.encode(stringBytes, i, codeLength); > > if(c == null && (i + 1 < stringBytes.length)){ > > codeLength++; > > c = font.encode(stringBytes, i, codeLength); > > } > > builder.append(c); > > } > > return builder.toString(); > > } > > > > 2. Iterating through the tokens, I find the text either it's a COSString > > ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text > > I'm looking for to remove as following: > > > > if (op.getOperation().equals("Tj")) { > > COSString previous = (COSString) tokens.get(j > - > > 1); > > String string = previous.getString(); > > String encodedString = encode(string, font); > > that string is already encoded. So you'd need to encode "To Be Approved" > and compare if that matches the string you are reading from the PDF. > > > if(encodedString.contains("To Be Approved")){ > > previous.reset(); > > } > > } else if (op.getOperation().equals("TJ")) { > > COSArray previous = (COSArray) tokens.get(j - > > 1); > > StringBuilder stringBuilder = new > > StringBuilder(); > > for (int k = 0; k < previous.size(); k++) { > > Object arrElement = previous.getObject(k); > > if (arrElement instanceof COSString) { > > COSString cosString = (COSString) > > arrElement; > > > > stringBuilder.append(cosString.getString()); > > } > > } > > String string = stringBuilder.toString(); > > String encodedString = encode(string, font); > > if(encodedString.contains("To Be Approved")){ > > previous.clear(); > > } > > } > > > > Note: > > In case of COSArray, I first iterate through the whole array to get the > > whole string before encoding and comparison and this works. > > > > Best Regards, > > a7mad > > > > > > > > On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <[email protected] > > > > wrote: > > > >> Hi, > >> > >> your text is encoded so within the show text operator Tj the string is > >> > >> 7R %H $SSURYHG > >> > >> You wrote that you encode your string to find it - what do you get? > >> > >> BR > >> Maruan > >> > >> > >> > >>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <[email protected]>: > >>> > >>> Hi Maruan, > >>> > >>> Here's a link from where you can download the PDF. > >>> > >>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing > >>> > >>> Kind Regards, > >>> a7mad > >>> > >>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < > [email protected]> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> you need to upload it to a public location as the mailing list doesn't > >>>> support attachments. > >>>> > >>>> BR > >>>> Maruan > >>>> > >>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <[email protected] > >: > >>>>> > >>>>> Dear Maruan, > >>>>> > >>>>> Thank you very much for the information. Please find herewith > attached > >>>> the PDF to reproduce the problem. > >>>>> The text to remove is: "To Be Approved". The text has a multi-byte > >>>> encoding, so I call first to encode it in order to find it then remove > >> it. > >>>>> > >>>>> Best Regards, > >>>>> a7mad > >>>>> > >>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < > >> [email protected]> > >>>> wrote: > >>>>>> Dear a7mad, > >>>>>> > >>>>>> removing text from a PDF is not an easy task as > >>>>>> - text which might visually appear as a single item might consistent > >> of > >>>> individual parts within the PDF itself e.g. each character or groups > of > >>>> characters are place individually in different COSStrings > >>>>>> - text might be drawn using graphics commands > >>>>>> - text can appear within different parts of the PDF (e.g. the text > >>>> might be content of a form field AND the annotation representing the > >> form > >>>> field visually) > >>>>>> - you need to look up the encoding information to get form the > >>>> characters in the PDF "string" to the ones you are looking for > >>>>>> …. > >>>>>> > >>>>>> If you can post a specific PDF to a public location and describe in > >>>> detail which string should have been replaced which hasn't I will be > >> able > >>>> to tell you why that might have happened. > >>>>>> > >>>>>> Maruan > >>>>>> > >>>>>> > >>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < > [email protected] > >>> : > >>>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> Currently I am facing a strange problem removing text from the some > >>>> PDFs. > >>>>>>> My program is able to find the text and "remove it" by calling the > >>>>>>> COSString.reset() method. > >>>>>>> The problem is, when I open the output PDF file, I still see the > text > >>>> but > >>>>>>> not selectable (I mean when I try to highlight it with the mouse to > >>>> copy > >>>>>>> it, it's not selectable!). When print the content (tokens) of the > >>>> output > >>>>>>> file, I DO NOT find the text at all!! > >>>>>>> > >>>>>>> I am currently stuck in the PDF specifications 1.5 and really > running > >>>> out > >>>>>>> of time. > >>>>>>> > >>>>>>> I'd so much appreciate any help or any idea on what's going on. > >>>>>>> > >>>>>>> Notes: > >>>>>>> 1. I use use PDFBox 1.7.1 > >>>>>>> 2. This problem does not occur with all PDFs, only some PDFs cause > >>>> this > >>>>>>> problem. > >>>>>>> > >>>>>>> Thank you very much. > >>>>>>> a7mad > >>>>>> > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>> > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: [email protected] > >>>>> For additional commands, e-mail: [email protected] > >>>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

