> Am 24.03.2015 um 10:43 schrieb a7med shre3y <[email protected]>: > > I mean how to find them in the PDF while rotating over the tokens, what is > the operator? > > On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun <[email protected]> > wrote: > >> >>> Am 24.03.2015 um 10:36 schrieb a7med shre3y <[email protected]>: >>> >>> What are the drawing commands? I'd then investigate one how to specify >> the >>> text ones. >>> >> >> 738.7469 167.1278 m
MoveTo >> 733.8743 167.1278 l >> LineTo >> >> >>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <[email protected] >>> >>> wrote: >>> >>>> >>>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <[email protected]>: >>>>> >>>>> That's true, I've even tried to change the rendering text mode to other >>>>> values already as mentioned in the PDF specs 1.5 table 5.3 before >>>> removing >>>>> it also didn't work. >>>>> So how to remove the graphics content then? >>>> >>>> the simple answer - remove the drawing commands. >>>> >>>> The longer answer as you obviously don't want to remove all drawing >>>> commands you'd need to find which are the ones drawing the text. As you >>>> would like to remove certain vectors which are matching a certain >>>> character/glyph you first need to find out which are the ones drawing >> e.g. >>>> the letter 'T'. I don't think that this is doable in a reasonable >> amount of >>>> time for arbitary text. >>>> >>>> Maruan >>>> >>>> >>>>> >>>>> Best Regards, >>>>> >>>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun < >> [email protected] >>>>> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <[email protected] >>> : >>>>>>> >>>>>>> You can download it from here: >>>>>>> >>>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing >>>>>>> >>>>>> >>>>>> looking more closely you correctly replaced the text, but that text >> was >>>> in >>>>>> there for searching within the PDF as it used text rendering mode 3 >>>>>> (invisible). The 'text' you are still seeing is drawn using vector >>>> commands >>>>>> so it's graphics content. >>>>>> >>>>>> BR >>>>>> Maruan >>>>>> >>>>>> >>>>>>> Best Regards, >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun < >>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y < >> [email protected] >>>>> : >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" >> to >>>>>> "To >>>>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or >>>> decoding, I >>>>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be >> Approved" >>>>>> and >>>>>>>>> not the opposite (or at least I don't know). I spent some quite >> long >>>>>> time >>>>>>>>> trying to find out how to find the character codes for the glyphs >> in >>>>>> the >>>>>>>>> currently used font, then I found that it's not an easy task. By >> the >>>>>> way, >>>>>>>>> if you know how to do that, I'd so much appreciate it because I >> need >>>>>> that >>>>>>>>> for replacing text with another text and for that the new text must >>>> be >>>>>>>>> encoded the same way as the original! >>>>>>>>> >>>>>>>>> Back to the text removal, I am able to find the text and also >> remove >>>> it >>>>>>>> by >>>>>>>>> calling reset, as I mentioned in my first email, when I print the >>>>>> output >>>>>>>>> content I don't find the text anymore but I still see it when I >> open >>>>>> the >>>>>>>>> file. My first assumption was that there must be some other way to >>>>>> remove >>>>>>>>> the text other than the way I am using, and that's what you've >>>> actually >>>>>>>>> confirmed in your reply, so could you please tell me what still >>>>>> missing? >>>>>>>>> >>>>>>>> >>>>>>>> Could you upload the PDF with the reset text too? >>>>>>>> >>>>>>>> BR >>>>>>>> Maruan >>>>>>>> >>>>>>>> >>>>>>>>> Thanks and regards, >>>>>>>>> a7mad >>>>>>>>> >>>>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun < >>>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y < >>>> [email protected] >>>>>>> : >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Here's how I do it: >>>>>>>>>>> >>>>>>>>>>> 1. I use the following method to encode the text: >>>>>>>>>>> >>>>>>>>>>> String encode(String text, PDFont font) throws Exception { >>>>>>>>>>> StringBuilder builder = new StringBuilder(); >>>>>>>>>>> byte[] stringBytes = text.getBytes(); >>>>>>>>>>> int codeLength = 1; >>>>>>>>>>> for(int i = 0; i < stringBytes.length; i += codeLength){ >>>>>>>>>>> String c = font.encode(stringBytes, i, codeLength); >>>>>>>>>>> if(c == null && (i + 1 < stringBytes.length)){ >>>>>>>>>>> codeLength++; >>>>>>>>>>> c = font.encode(stringBytes, i, codeLength); >>>>>>>>>>> } >>>>>>>>>>> builder.append(c); >>>>>>>>>>> } >>>>>>>>>>> return builder.toString(); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> 2. Iterating through the tokens, I find the text either it's a >>>>>>>> COSString >>>>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's >>>> the >>>>>>>> text >>>>>>>>>>> I'm looking for to remove as following: >>>>>>>>>>> >>>>>>>>>>> if (op.getOperation().equals("Tj")) { >>>>>>>>>>> COSString previous = (COSString) >>>>>> tokens.get(j >>>>>>>>>> - >>>>>>>>>>> 1); >>>>>>>>>>> String string = previous.getString(); >>>>>>>>>>> String encodedString = encode(string, >>>> font); >>>>>>>>>> >>>>>>>>>> that string is already encoded. So you'd need to encode "To Be >>>>>> Approved" >>>>>>>>>> and compare if that matches the string you are reading from the >> PDF. >>>>>>>>>> >>>>>>>>>>> if(encodedString.contains("To Be >>>>>> Approved")){ >>>>>>>>>>> previous.reset(); >>>>>>>>>>> } >>>>>>>>>>> } else if (op.getOperation().equals("TJ")) { >>>>>>>>>>> COSArray previous = (COSArray) >> tokens.get(j >>>>>> - >>>>>>>>>>> 1); >>>>>>>>>>> StringBuilder stringBuilder = new >>>>>>>>>>> StringBuilder(); >>>>>>>>>>> for (int k = 0; k < previous.size(); k++) >> { >>>>>>>>>>> Object arrElement = >>>>>>>> previous.getObject(k); >>>>>>>>>>> if (arrElement instanceof COSString) { >>>>>>>>>>> COSString cosString = (COSString) >>>>>>>>>>> arrElement; >>>>>>>>>>> >>>>>>>>>>> stringBuilder.append(cosString.getString()); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> String string = stringBuilder.toString(); >>>>>>>>>>> String encodedString = encode(string, >>>> font); >>>>>>>>>>> if(encodedString.contains("To Be >>>>>> Approved")){ >>>>>>>>>>> previous.clear(); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Note: >>>>>>>>>>> In case of COSArray, I first iterate through the whole array to >> get >>>>>> the >>>>>>>>>>> whole string before encoding and comparison and this works. >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> a7mad >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun < >>>>>>>> [email protected] >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> your text is encoded so within the show text operator Tj the >>>> string >>>>>> is >>>>>>>>>>>> >>>>>>>>>>>> 7R %H $SSURYHG >>>>>>>>>>>> >>>>>>>>>>>> You wrote that you encode your string to find it - what do you >>>> get? >>>>>>>>>>>> >>>>>>>>>>>> BR >>>>>>>>>>>> Maruan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y < >>>>>> [email protected] >>>>>>>>> : >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Maruan, >>>>>>>>>>>>> >>>>>>>>>>>>> Here's a link from where you can download the PDF. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing >>>>>>>>>>>>> >>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>> a7mad >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < >>>>>>>>>> [email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> you need to upload it to a public location as the mailing list >>>>>>>> doesn't >>>>>>>>>>>>>> support attachments. >>>>>>>>>>>>>> >>>>>>>>>>>>>> BR >>>>>>>>>>>>>> Maruan >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y < >>>>>>>> [email protected] >>>>>>>>>>> : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dear Maruan, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you very much for the information. Please find herewith >>>>>>>>>> attached >>>>>>>>>>>>>> the PDF to reproduce the problem. >>>>>>>>>>>>>>> The text to remove is: "To Be Approved". The text has a >>>>>> multi-byte >>>>>>>>>>>>>> encoding, so I call first to encode it in order to find it >> then >>>>>>>> remove >>>>>>>>>>>> it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>> a7mad >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < >>>>>>>>>>>> [email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> Dear a7mad, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> removing text from a PDF is not an easy task as >>>>>>>>>>>>>>>> - text which might visually appear as a single item might >>>>>>>> consistent >>>>>>>>>>>> of >>>>>>>>>>>>>> individual parts within the PDF itself e.g. each character or >>>>>> groups >>>>>>>>>> of >>>>>>>>>>>>>> characters are place individually in different COSStrings >>>>>>>>>>>>>>>> - text might be drawn using graphics commands >>>>>>>>>>>>>>>> - text can appear within different parts of the PDF (e.g. >> the >>>>>> text >>>>>>>>>>>>>> might be content of a form field AND the annotation >> representing >>>>>> the >>>>>>>>>>>> form >>>>>>>>>>>>>> field visually) >>>>>>>>>>>>>>>> - you need to look up the encoding information to get form >> the >>>>>>>>>>>>>> characters in the PDF "string" to the ones you are looking for >>>>>>>>>>>>>>>> …. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If you can post a specific PDF to a public location and >>>> describe >>>>>>>> in >>>>>>>>>>>>>> detail which string should have been replaced which hasn't I >>>> will >>>>>> be >>>>>>>>>>>> able >>>>>>>>>>>>>> to tell you why that might have happened. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Maruan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < >>>>>>>>>> [email protected] >>>>>>>>>>>>> : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Currently I am facing a strange problem removing text from >>>> the >>>>>>>> some >>>>>>>>>>>>>> PDFs. >>>>>>>>>>>>>>>>> My program is able to find the text and "remove it" by >>>> calling >>>>>>>> the >>>>>>>>>>>>>>>>> COSString.reset() method. >>>>>>>>>>>>>>>>> The problem is, when I open the output PDF file, I still >> see >>>>>> the >>>>>>>>>> text >>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>> not selectable (I mean when I try to highlight it with the >>>>>> mouse >>>>>>>> to >>>>>>>>>>>>>> copy >>>>>>>>>>>>>>>>> it, it's not selectable!). When print the content (tokens) >> of >>>>>> the >>>>>>>>>>>>>> output >>>>>>>>>>>>>>>>> file, I DO NOT find the text at all!! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and >> really >>>>>>>>>> running >>>>>>>>>>>>>> out >>>>>>>>>>>>>>>>> of time. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'd so much appreciate any help or any idea on what's going >>>> on. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Notes: >>>>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1 >>>>>>>>>>>>>>>>> 2. This problem does not occur with all PDFs, only some >> PDFs >>>>>>>> cause >>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you very much. >>>>>>>>>>>>>>>>> a7mad >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>>>>>>>> For additional commands, e-mail: >> [email protected] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>>>>>>> For additional commands, e-mail: >> [email protected] >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

