Hi,
> Am 24.03.2015 um 08:14 schrieb a7med shre3y <[email protected]>:
>
> Hi,
>
> Here's how I do it:
>
> 1. I use the following method to encode the text:
>
> String encode(String text, PDFont font) throws Exception {
> StringBuilder builder = new StringBuilder();
> byte[] stringBytes = text.getBytes();
> int codeLength = 1;
> for(int i = 0; i < stringBytes.length; i += codeLength){
> String c = font.encode(stringBytes, i, codeLength);
> if(c == null && (i + 1 < stringBytes.length)){
> codeLength++;
> c = font.encode(stringBytes, i, codeLength);
> }
> builder.append(c);
> }
> return builder.toString();
> }
>
> 2. Iterating through the tokens, I find the text either it's a COSString
> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
> I'm looking for to remove as following:
>
> if (op.getOperation().equals("Tj")) {
> COSString previous = (COSString) tokens.get(j -
> 1);
> String string = previous.getString();
> String encodedString = encode(string, font);
that string is already encoded. So you'd need to encode "To Be Approved" and
compare if that matches the string you are reading from the PDF.
> if(encodedString.contains("To Be Approved")){
> previous.reset();
> }
> } else if (op.getOperation().equals("TJ")) {
> COSArray previous = (COSArray) tokens.get(j -
> 1);
> StringBuilder stringBuilder = new
> StringBuilder();
> for (int k = 0; k < previous.size(); k++) {
> Object arrElement = previous.getObject(k);
> if (arrElement instanceof COSString) {
> COSString cosString = (COSString)
> arrElement;
>
> stringBuilder.append(cosString.getString());
> }
> }
> String string = stringBuilder.toString();
> String encodedString = encode(string, font);
> if(encodedString.contains("To Be Approved")){
> previous.clear();
> }
> }
>
> Note:
> In case of COSArray, I first iterate through the whole array to get the
> whole string before encoding and comparison and this works.
>
> Best Regards,
> a7mad
>
>
>
> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <[email protected]>
> wrote:
>
>> Hi,
>>
>> your text is encoded so within the show text operator Tj the string is
>>
>> 7R %H $SSURYHG
>>
>> You wrote that you encode your string to find it - what do you get?
>>
>> BR
>> Maruan
>>
>>
>>
>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <[email protected]>:
>>>
>>> Hi Maruan,
>>>
>>> Here's a link from where you can download the PDF.
>>>
>>>
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>>
>>> Kind Regards,
>>> a7mad
>>>
>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> you need to upload it to a public location as the mailing list doesn't
>>>> support attachments.
>>>>
>>>> BR
>>>> Maruan
>>>>
>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <[email protected]>:
>>>>>
>>>>> Dear Maruan,
>>>>>
>>>>> Thank you very much for the information. Please find herewith attached
>>>> the PDF to reproduce the problem.
>>>>> The text to remove is: "To Be Approved". The text has a multi-byte
>>>> encoding, so I call first to encode it in order to find it then remove
>> it.
>>>>>
>>>>> Best Regards,
>>>>> a7mad
>>>>>
>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
>> [email protected]>
>>>> wrote:
>>>>>> Dear a7mad,
>>>>>>
>>>>>> removing text from a PDF is not an easy task as
>>>>>> - text which might visually appear as a single item might consistent
>> of
>>>> individual parts within the PDF itself e.g. each character or groups of
>>>> characters are place individually in different COSStrings
>>>>>> - text might be drawn using graphics commands
>>>>>> - text can appear within different parts of the PDF (e.g. the text
>>>> might be content of a form field AND the annotation representing the
>> form
>>>> field visually)
>>>>>> - you need to look up the encoding information to get form the
>>>> characters in the PDF "string" to the ones you are looking for
>>>>>> ….
>>>>>>
>>>>>> If you can post a specific PDF to a public location and describe in
>>>> detail which string should have been replaced which hasn't I will be
>> able
>>>> to tell you why that might have happened.
>>>>>>
>>>>>> Maruan
>>>>>>
>>>>>>
>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <[email protected]
>>> :
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Currently I am facing a strange problem removing text from the some
>>>> PDFs.
>>>>>>> My program is able to find the text and "remove it" by calling the
>>>>>>> COSString.reset() method.
>>>>>>> The problem is, when I open the output PDF file, I still see the text
>>>> but
>>>>>>> not selectable (I mean when I try to highlight it with the mouse to
>>>> copy
>>>>>>> it, it's not selectable!). When print the content (tokens) of the
>>>> output
>>>>>>> file, I DO NOT find the text at all!!
>>>>>>>
>>>>>>> I am currently stuck in the PDF specifications 1.5 and really running
>>>> out
>>>>>>> of time.
>>>>>>>
>>>>>>> I'd so much appreciate any help or any idea on what's going on.
>>>>>>>
>>>>>>> Notes:
>>>>>>> 1. I use use PDFBox 1.7.1
>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs cause
>>>> this
>>>>>>> problem.
>>>>>>>
>>>>>>> Thank you very much.
>>>>>>> a7mad
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]