What are the drawing commands? I'd then investigate one how to specify the text ones.
On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <[email protected]> wrote: > > > Am 24.03.2015 um 10:14 schrieb a7med shre3y <[email protected]>: > > > > That's true, I've even tried to change the rendering text mode to other > > values already as mentioned in the PDF specs 1.5 table 5.3 before > removing > > it also didn't work. > > So how to remove the graphics content then? > > the simple answer - remove the drawing commands. > > The longer answer as you obviously don't want to remove all drawing > commands you'd need to find which are the ones drawing the text. As you > would like to remove certain vectors which are matching a certain > character/glyph you first need to find out which are the ones drawing e.g. > the letter 'T'. I don't think that this is doable in a reasonable amount of > time for arbitary text. > > Maruan > > > > > > Best Regards, > > > > On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <[email protected] > > > > wrote: > > > >> Hi, > >> > >>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <[email protected]>: > >>> > >>> You can download it from here: > >>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing > >>> > >> > >> looking more closely you correctly replaced the text, but that text was > in > >> there for searching within the PDF as it used text rendering mode 3 > >> (invisible). The 'text' you are still seeing is drawn using vector > commands > >> so it's graphics content. > >> > >> BR > >> Maruan > >> > >> > >>> Best Regards, > >>> > >>> > >>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun < > [email protected]> > >>> wrote: > >>> > >>>> > >>>> > >>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <[email protected] > >: > >>>>> > >>>>> Hi, > >>>>> > >>>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to > >> "To > >>>>> Be Approved" as "encoding". Anyway, either it's encoding or > decoding, I > >>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" > >> and > >>>>> not the opposite (or at least I don't know). I spent some quite long > >> time > >>>>> trying to find out how to find the character codes for the glyphs in > >> the > >>>>> currently used font, then I found that it's not an easy task. By the > >> way, > >>>>> if you know how to do that, I'd so much appreciate it because I need > >> that > >>>>> for replacing text with another text and for that the new text must > be > >>>>> encoded the same way as the original! > >>>>> > >>>>> Back to the text removal, I am able to find the text and also remove > it > >>>> by > >>>>> calling reset, as I mentioned in my first email, when I print the > >> output > >>>>> content I don't find the text anymore but I still see it when I open > >> the > >>>>> file. My first assumption was that there must be some other way to > >> remove > >>>>> the text other than the way I am using, and that's what you've > actually > >>>>> confirmed in your reply, so could you please tell me what still > >> missing? > >>>>> > >>>> > >>>> Could you upload the PDF with the reset text too? > >>>> > >>>> BR > >>>> Maruan > >>>> > >>>> > >>>>> Thanks and regards, > >>>>> a7mad > >>>>> > >>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun < > >> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y < > [email protected] > >>> : > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> Here's how I do it: > >>>>>>> > >>>>>>> 1. I use the following method to encode the text: > >>>>>>> > >>>>>>> String encode(String text, PDFont font) throws Exception { > >>>>>>> StringBuilder builder = new StringBuilder(); > >>>>>>> byte[] stringBytes = text.getBytes(); > >>>>>>> int codeLength = 1; > >>>>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ > >>>>>>> String c = font.encode(stringBytes, i, codeLength); > >>>>>>> if(c == null && (i + 1 < stringBytes.length)){ > >>>>>>> codeLength++; > >>>>>>> c = font.encode(stringBytes, i, codeLength); > >>>>>>> } > >>>>>>> builder.append(c); > >>>>>>> } > >>>>>>> return builder.toString(); > >>>>>>> } > >>>>>>> > >>>>>>> 2. Iterating through the tokens, I find the text either it's a > >>>> COSString > >>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's > the > >>>> text > >>>>>>> I'm looking for to remove as following: > >>>>>>> > >>>>>>> if (op.getOperation().equals("Tj")) { > >>>>>>> COSString previous = (COSString) > >> tokens.get(j > >>>>>> - > >>>>>>> 1); > >>>>>>> String string = previous.getString(); > >>>>>>> String encodedString = encode(string, > font); > >>>>>> > >>>>>> that string is already encoded. So you'd need to encode "To Be > >> Approved" > >>>>>> and compare if that matches the string you are reading from the PDF. > >>>>>> > >>>>>>> if(encodedString.contains("To Be > >> Approved")){ > >>>>>>> previous.reset(); > >>>>>>> } > >>>>>>> } else if (op.getOperation().equals("TJ")) { > >>>>>>> COSArray previous = (COSArray) tokens.get(j > >> - > >>>>>>> 1); > >>>>>>> StringBuilder stringBuilder = new > >>>>>>> StringBuilder(); > >>>>>>> for (int k = 0; k < previous.size(); k++) { > >>>>>>> Object arrElement = > >>>> previous.getObject(k); > >>>>>>> if (arrElement instanceof COSString) { > >>>>>>> COSString cosString = (COSString) > >>>>>>> arrElement; > >>>>>>> > >>>>>>> stringBuilder.append(cosString.getString()); > >>>>>>> } > >>>>>>> } > >>>>>>> String string = stringBuilder.toString(); > >>>>>>> String encodedString = encode(string, > font); > >>>>>>> if(encodedString.contains("To Be > >> Approved")){ > >>>>>>> previous.clear(); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> Note: > >>>>>>> In case of COSArray, I first iterate through the whole array to get > >> the > >>>>>>> whole string before encoding and comparison and this works. > >>>>>>> > >>>>>>> Best Regards, > >>>>>>> a7mad > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < > >>>> [email protected] > >>>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> your text is encoded so within the show text operator Tj the > string > >> is > >>>>>>>> > >>>>>>>> 7R %H $SSURYHG > >>>>>>>> > >>>>>>>> You wrote that you encode your string to find it - what do you > get? > >>>>>>>> > >>>>>>>> BR > >>>>>>>> Maruan > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y < > >> [email protected] > >>>>> : > >>>>>>>>> > >>>>>>>>> Hi Maruan, > >>>>>>>>> > >>>>>>>>> Here's a link from where you can download the PDF. > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing > >>>>>>>>> > >>>>>>>>> Kind Regards, > >>>>>>>>> a7mad > >>>>>>>>> > >>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < > >>>>>> [email protected]> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> you need to upload it to a public location as the mailing list > >>>> doesn't > >>>>>>>>>> support attachments. > >>>>>>>>>> > >>>>>>>>>> BR > >>>>>>>>>> Maruan > >>>>>>>>>> > >>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < > >>>> [email protected] > >>>>>>> : > >>>>>>>>>>> > >>>>>>>>>>> Dear Maruan, > >>>>>>>>>>> > >>>>>>>>>>> Thank you very much for the information. Please find herewith > >>>>>> attached > >>>>>>>>>> the PDF to reproduce the problem. > >>>>>>>>>>> The text to remove is: "To Be Approved". The text has a > >> multi-byte > >>>>>>>>>> encoding, so I call first to encode it in order to find it then > >>>> remove > >>>>>>>> it. > >>>>>>>>>>> > >>>>>>>>>>> Best Regards, > >>>>>>>>>>> a7mad > >>>>>>>>>>> > >>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < > >>>>>>>> [email protected]> > >>>>>>>>>> wrote: > >>>>>>>>>>>> Dear a7mad, > >>>>>>>>>>>> > >>>>>>>>>>>> removing text from a PDF is not an easy task as > >>>>>>>>>>>> - text which might visually appear as a single item might > >>>> consistent > >>>>>>>> of > >>>>>>>>>> individual parts within the PDF itself e.g. each character or > >> groups > >>>>>> of > >>>>>>>>>> characters are place individually in different COSStrings > >>>>>>>>>>>> - text might be drawn using graphics commands > >>>>>>>>>>>> - text can appear within different parts of the PDF (e.g. the > >> text > >>>>>>>>>> might be content of a form field AND the annotation representing > >> the > >>>>>>>> form > >>>>>>>>>> field visually) > >>>>>>>>>>>> - you need to look up the encoding information to get form the > >>>>>>>>>> characters in the PDF "string" to the ones you are looking for > >>>>>>>>>>>> …. > >>>>>>>>>>>> > >>>>>>>>>>>> If you can post a specific PDF to a public location and > describe > >>>> in > >>>>>>>>>> detail which string should have been replaced which hasn't I > will > >> be > >>>>>>>> able > >>>>>>>>>> to tell you why that might have happened. > >>>>>>>>>>>> > >>>>>>>>>>>> Maruan > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < > >>>>>> [email protected] > >>>>>>>>> : > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Currently I am facing a strange problem removing text from > the > >>>> some > >>>>>>>>>> PDFs. > >>>>>>>>>>>>> My program is able to find the text and "remove it" by > calling > >>>> the > >>>>>>>>>>>>> COSString.reset() method. > >>>>>>>>>>>>> The problem is, when I open the output PDF file, I still see > >> the > >>>>>> text > >>>>>>>>>> but > >>>>>>>>>>>>> not selectable (I mean when I try to highlight it with the > >> mouse > >>>> to > >>>>>>>>>> copy > >>>>>>>>>>>>> it, it's not selectable!). When print the content (tokens) of > >> the > >>>>>>>>>> output > >>>>>>>>>>>>> file, I DO NOT find the text at all!! > >>>>>>>>>>>>> > >>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and really > >>>>>> running > >>>>>>>>>> out > >>>>>>>>>>>>> of time. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's going > on. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Notes: > >>>>>>>>>>>>> 1. I use use PDFBox 1.7.1 > >>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs > >>>> cause > >>>>>>>>>> this > >>>>>>>>>>>>> problem. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thank you very much. > >>>>>>>>>>>>> a7mad > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>>> > >>>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

