> Am 24.03.2015 um 09:40 schrieb a7med shre3y <[email protected]>: > > Hi, > > In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To > Be Approved" as "encoding". Anyway, either it's encoding or decoding, I > thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and > not the opposite (or at least I don't know). I spent some quite long time > trying to find out how to find the character codes for the glyphs in the > currently used font, then I found that it's not an easy task. By the way, > if you know how to do that, I'd so much appreciate it because I need that > for replacing text with another text and for that the new text must be > encoded the same way as the original! > > Back to the text removal, I am able to find the text and also remove it by > calling reset, as I mentioned in my first email, when I print the output > content I don't find the text anymore but I still see it when I open the > file. My first assumption was that there must be some other way to remove > the text other than the way I am using, and that's what you've actually > confirmed in your reply, so could you please tell me what still missing? >
Could you upload the PDF with the reset text too? BR Maruan > Thanks and regards, > a7mad > > On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <[email protected]> > wrote: > >> Hi, >> >>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <[email protected]>: >>> >>> Hi, >>> >>> Here's how I do it: >>> >>> 1. I use the following method to encode the text: >>> >>> String encode(String text, PDFont font) throws Exception { >>> StringBuilder builder = new StringBuilder(); >>> byte[] stringBytes = text.getBytes(); >>> int codeLength = 1; >>> for(int i = 0; i < stringBytes.length; i += codeLength){ >>> String c = font.encode(stringBytes, i, codeLength); >>> if(c == null && (i + 1 < stringBytes.length)){ >>> codeLength++; >>> c = font.encode(stringBytes, i, codeLength); >>> } >>> builder.append(c); >>> } >>> return builder.toString(); >>> } >>> >>> 2. Iterating through the tokens, I find the text either it's a COSString >>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text >>> I'm looking for to remove as following: >>> >>> if (op.getOperation().equals("Tj")) { >>> COSString previous = (COSString) tokens.get(j >> - >>> 1); >>> String string = previous.getString(); >>> String encodedString = encode(string, font); >> >> that string is already encoded. So you'd need to encode "To Be Approved" >> and compare if that matches the string you are reading from the PDF. >> >>> if(encodedString.contains("To Be Approved")){ >>> previous.reset(); >>> } >>> } else if (op.getOperation().equals("TJ")) { >>> COSArray previous = (COSArray) tokens.get(j - >>> 1); >>> StringBuilder stringBuilder = new >>> StringBuilder(); >>> for (int k = 0; k < previous.size(); k++) { >>> Object arrElement = previous.getObject(k); >>> if (arrElement instanceof COSString) { >>> COSString cosString = (COSString) >>> arrElement; >>> >>> stringBuilder.append(cosString.getString()); >>> } >>> } >>> String string = stringBuilder.toString(); >>> String encodedString = encode(string, font); >>> if(encodedString.contains("To Be Approved")){ >>> previous.clear(); >>> } >>> } >>> >>> Note: >>> In case of COSArray, I first iterate through the whole array to get the >>> whole string before encoding and comparison and this works. >>> >>> Best Regards, >>> a7mad >>> >>> >>> >>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <[email protected] >>> >>> wrote: >>> >>>> Hi, >>>> >>>> your text is encoded so within the show text operator Tj the string is >>>> >>>> 7R %H $SSURYHG >>>> >>>> You wrote that you encode your string to find it - what do you get? >>>> >>>> BR >>>> Maruan >>>> >>>> >>>> >>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <[email protected]>: >>>>> >>>>> Hi Maruan, >>>>> >>>>> Here's a link from where you can download the PDF. >>>>> >>>>> >>>> >> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing >>>>> >>>>> Kind Regards, >>>>> a7mad >>>>> >>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun < >> [email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> you need to upload it to a public location as the mailing list doesn't >>>>>> support attachments. >>>>>> >>>>>> BR >>>>>> Maruan >>>>>> >>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <[email protected] >>> : >>>>>>> >>>>>>> Dear Maruan, >>>>>>> >>>>>>> Thank you very much for the information. Please find herewith >> attached >>>>>> the PDF to reproduce the problem. >>>>>>> The text to remove is: "To Be Approved". The text has a multi-byte >>>>>> encoding, so I call first to encode it in order to find it then remove >>>> it. >>>>>>> >>>>>>> Best Regards, >>>>>>> a7mad >>>>>>> >>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun < >>>> [email protected]> >>>>>> wrote: >>>>>>>> Dear a7mad, >>>>>>>> >>>>>>>> removing text from a PDF is not an easy task as >>>>>>>> - text which might visually appear as a single item might consistent >>>> of >>>>>> individual parts within the PDF itself e.g. each character or groups >> of >>>>>> characters are place individually in different COSStrings >>>>>>>> - text might be drawn using graphics commands >>>>>>>> - text can appear within different parts of the PDF (e.g. the text >>>>>> might be content of a form field AND the annotation representing the >>>> form >>>>>> field visually) >>>>>>>> - you need to look up the encoding information to get form the >>>>>> characters in the PDF "string" to the ones you are looking for >>>>>>>> …. >>>>>>>> >>>>>>>> If you can post a specific PDF to a public location and describe in >>>>>> detail which string should have been replaced which hasn't I will be >>>> able >>>>>> to tell you why that might have happened. >>>>>>>> >>>>>>>> Maruan >>>>>>>> >>>>>>>> >>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y < >> [email protected] >>>>> : >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Currently I am facing a strange problem removing text from the some >>>>>> PDFs. >>>>>>>>> My program is able to find the text and "remove it" by calling the >>>>>>>>> COSString.reset() method. >>>>>>>>> The problem is, when I open the output PDF file, I still see the >> text >>>>>> but >>>>>>>>> not selectable (I mean when I try to highlight it with the mouse to >>>>>> copy >>>>>>>>> it, it's not selectable!). When print the content (tokens) of the >>>>>> output >>>>>>>>> file, I DO NOT find the text at all!! >>>>>>>>> >>>>>>>>> I am currently stuck in the PDF specifications 1.5 and really >> running >>>>>> out >>>>>>>>> of time. >>>>>>>>> >>>>>>>>> I'd so much appreciate any help or any idea on what's going on. >>>>>>>>> >>>>>>>>> Notes: >>>>>>>>> 1. I use use PDFBox 1.7.1 >>>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs cause >>>>>> this >>>>>>>>> problem. >>>>>>>>> >>>>>>>>> Thank you very much. >>>>>>>>> a7mad >>>>>>>> >>>>>>>> >>>>>>>> >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

