Re: Text removal

Maruan Sahyoun Tue, 24 Mar 2015 01:49:46 -0700


> Am 24.03.2015 um 09:40 schrieb a7med shre3y <[email protected]>:
> 
> Hi,
> 
> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
> Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
> not the opposite (or at least I don't know). I spent some quite long time
> trying to find out how to find the character codes for the glyphs in the
> currently used font, then I found that it's not an easy task. By the way,
> if you know how to do that, I'd so much appreciate it because I need that
> for replacing text with another text and for that the new text must be
> encoded the same way as the original!
> 
> Back to the text removal, I am able to find the text and also remove it by
> calling reset, as I mentioned in my first email, when I print the output
> content I don't find the text anymore but I still see it when I open the
> file. My first assumption was that there must be some other way to remove
> the text other than the way I am using, and that's what you've actually
> confirmed in your reply, so could you please tell me what still missing?
>


Could you upload the PDF with the reset text too?

BR
Maruan


> Thanks and regards,
> a7mad
> 
> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <[email protected]>
> wrote:
> 
>> Hi,
>> 
>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <[email protected]>:
>>> 
>>> Hi,
>>> 
>>> Here's how I do it:
>>> 
>>> 1. I use the following method to encode the text:
>>> 
>>> String encode(String text, PDFont font) throws Exception {
>>>       StringBuilder builder = new StringBuilder();
>>>       byte[] stringBytes = text.getBytes();
>>>       int codeLength = 1;
>>>       for(int i = 0; i < stringBytes.length; i += codeLength){
>>>               String c = font.encode(stringBytes, i, codeLength);
>>>               if(c == null && (i + 1 < stringBytes.length)){
>>>                   codeLength++;
>>>                   c = font.encode(stringBytes, i, codeLength);
>>>               }
>>>               builder.append(c);
>>>           }
>>>       return builder.toString();
>>>   }
>>> 
>>> 2. Iterating through the tokens, I find the text either it's a COSString
>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
>>> I'm looking for to remove as following:
>>> 
>>> if (op.getOperation().equals("Tj")) {
>>>                           COSString previous = (COSString) tokens.get(j
>> -
>>> 1);
>>>                           String string = previous.getString();
>>>                           String encodedString = encode(string, font);
>> 
>> that string is already encoded. So you'd need to encode "To Be Approved"
>> and compare if that matches the string you are reading from the PDF.
>> 
>>>                           if(encodedString.contains("To Be Approved")){
>>>                               previous.reset();
>>>                           }
>>>                       } else if (op.getOperation().equals("TJ")) {
>>>                           COSArray previous = (COSArray) tokens.get(j -
>>> 1);
>>>                           StringBuilder stringBuilder = new
>>> StringBuilder();
>>>                           for (int k = 0; k < previous.size(); k++) {
>>>                               Object arrElement = previous.getObject(k);
>>>                               if (arrElement instanceof COSString) {
>>>                                   COSString cosString = (COSString)
>>> arrElement;
>>> 
>>> stringBuilder.append(cosString.getString());
>>>                               }
>>>                           }
>>>                           String string = stringBuilder.toString();
>>>                           String encodedString = encode(string, font);
>>>                           if(encodedString.contains("To Be Approved")){
>>>                               previous.clear();
>>>                           }
>>>                       }
>>> 
>>> Note:
>>> In case of COSArray, I first iterate through the whole array to get the
>>> whole string before encoding and comparison and this works.
>>> 
>>> Best Regards,
>>> a7mad
>>> 
>>> 
>>> 
>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <[email protected]
>>> 
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> your text is encoded so within the show text operator Tj the string is
>>>> 
>>>> 7R %H $SSURYHG
>>>> 
>>>> You wrote that you encode your string to find it - what do you get?
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> 
>>>> 
>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <[email protected]>:
>>>>> 
>>>>> Hi Maruan,
>>>>> 
>>>>> Here's a link from where you can download the PDF.
>>>>> 
>>>>> 
>>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>>>> 
>>>>> Kind Regards,
>>>>> a7mad
>>>>> 
>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> you need to upload it to a public location as the mailing list doesn't
>>>>>> support attachments.
>>>>>> 
>>>>>> BR
>>>>>> Maruan
>>>>>> 
>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <[email protected]
>>> :
>>>>>>> 
>>>>>>> Dear Maruan,
>>>>>>> 
>>>>>>> Thank you very much for the information. Please find herewith
>> attached
>>>>>> the PDF to reproduce the problem.
>>>>>>> The text to remove is: "To Be Approved". The text has a multi-byte
>>>>>> encoding, so I call first to encode it in order to find it then remove
>>>> it.
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> a7mad
>>>>>>> 
>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
>>>> [email protected]>
>>>>>> wrote:
>>>>>>>> Dear a7mad,
>>>>>>>> 
>>>>>>>> removing text from a PDF is not an easy task as
>>>>>>>> - text which might visually appear as a single item might consistent
>>>> of
>>>>>> individual parts within the PDF itself e.g. each character or groups
>> of
>>>>>> characters are place individually in different COSStrings
>>>>>>>> - text might be drawn using graphics commands
>>>>>>>> - text can appear within different parts of the PDF (e.g. the text
>>>>>> might be content of a form field AND the annotation representing the
>>>> form
>>>>>> field visually)
>>>>>>>> - you need to look up the encoding information to get form the
>>>>>> characters in the PDF "string" to the ones you are looking for
>>>>>>>> ….
>>>>>>>> 
>>>>>>>> If you can post a specific PDF to a public location and describe in
>>>>>> detail which string should have been replaced which hasn't I will be
>>>> able
>>>>>> to tell you why that might have happened.
>>>>>>>> 
>>>>>>>> Maruan
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <
>> [email protected]
>>>>> :
>>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> Currently I am facing a strange problem removing text from the some
>>>>>> PDFs.
>>>>>>>>> My program is able to find the text and "remove it" by calling the
>>>>>>>>> COSString.reset() method.
>>>>>>>>> The problem is, when I open the output PDF file, I still see the
>> text
>>>>>> but
>>>>>>>>> not selectable (I mean when I try to highlight it with the mouse to
>>>>>> copy
>>>>>>>>> it, it's not selectable!). When print the content (tokens) of the
>>>>>> output
>>>>>>>>> file, I DO NOT find the text at all!!
>>>>>>>>> 
>>>>>>>>> I am currently stuck in the PDF specifications 1.5 and really
>> running
>>>>>> out
>>>>>>>>> of time.
>>>>>>>>> 
>>>>>>>>> I'd so much appreciate any help or any idea on what's going on.
>>>>>>>>> 
>>>>>>>>> Notes:
>>>>>>>>> 1. I use use PDFBox 1.7.1
>>>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs cause
>>>>>> this
>>>>>>>>> problem.
>>>>>>>>> 
>>>>>>>>> Thank you very much.
>>>>>>>>> a7mad
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Text removal

Reply via email to