Re: [liberationtech] Metadata Cleanup trough File Format Convertion?

2013-07-17 Thread jvoisin
On 17/07/2013 21:22, Nick wrote:
> Quoth Fabio Pietrosanti (naif):
>> If a JPEG is converted to PNG, "maybe" all metadatas are lost. (this
>> has to be verified)
>> If a DOC/DOCX is converted to a PDF, maybe all metadatas are lost.
> 
> Interesting topic. I'd be most worried about watermarks, as 
> depending on the format they may well remain, and be difficult to 
> find or test for. I don't know if they're routinely used, but it's 
> certainly something to be aware of.
> --

Did you know about the MAT (https://mat.boum.org) ?



--
Too many emails? Unsubscribe, change to digest, or change password by emailing 
moderator at compa...@stanford.edu or changing your settings at 
https://mailman.stanford.edu/mailman/listinfo/liberationtech

Re: [liberationtech] [open-science] Removing watermarks from pdfs (pdfparanoia)

2013-02-10 Thread jvoisin
hello,
I am the developer behind the previously cited MAT
(https://mat.boum.org). I just want to add my 2 cents based on what I
learned by developing metadata-anonymisation processes.

Since visible metadata like lines of text, or pictures can be detected
visually and removed with the help of some pdfminer-fu, I rather speak
about hidden metadata/watermarks.

Since PDF is a pretty complex format to process, I'm doing a rendering
of it on a cairo[1] surface, and then saving this surface to a PDF file.
Since this produces a completely new PDF, this strips a large part of
(if not all) hidden wartermarks/metadata, without transforming the text
into pictures. The whole process is implemented in MAT [2].

This could be added in pdfparanoia to counter hidden threats.


1. http://www.cairographics.org/
2.
https://gitweb.torproject.org/user/jvoisin/mat.git/blob/HEAD:/MAT/office.py#l141

--
Unsubscribe, change to digest, or change password at: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech