Re: [Commons-l] Duplicate removal?

James Heald Thu, 04 Dec 2014 02:06:07 -0800

Just out of interest, would your algorithm identify eg


https://commons.wikimedia.org/wiki/File:Portrait_du_Bienheureux_Pierre_de_Luxembourg_-_Mus%C3%A9e_du_Petit_Palais_d%27Avignon.jpg

and

https://commons.wikimedia.org/wiki/File:Master_Of_The_Avignon_School_-_Vision_of_Peter_of_Luxembourg_-_WGA14511.jpg

as duplicates? (Just as a pair of images I happen to have run acrossthis morning).

They're very similar, though the smaller image is in fact sharper, alittle darker, and slightly differently framed.


So I'd be interested whether they would ping the algorithm or not.

All best,

  James.


On 04/12/2014 09:43, Jonas Öberg wrote:

Hi everyone,

Careful here - algorithms that spot almost-duplicates will happily
flag different shots from the same shoot. Definitely not something to
act upon without close human inspection.


I agree, and I wouldn't want to flag anything automatically based on
our findings.

The algorithm we use is meant to capture verbatim re-use, not
derivative works. This means that it does a very poor job at matching
images that are different photographic reproductions of the same work
(light conditions, angles, borders, etc, will all differ). It does a
fairly good job at matching images that are verbatim copies, allowing
for resizing and format changes, but it's not perfect, and we
definitely end up with the same hash for some images, even if they're
not identical. This happens often with maps, for instance. For example
two maps of US states, one marking Washington in red and one marking
California in red. With no other differences, they'll end up hashed
very close to each other.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l



_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l

Re: [Commons-l] Duplicate removal?

Reply via email to