Just out of interest, would your algorithm identify eg

https://commons.wikimedia.org/wiki/File:Portrait_du_Bienheureux_Pierre_de_Luxembourg_-_Mus%C3%A9e_du_Petit_Palais_d%27Avignon.jpg

and

https://commons.wikimedia.org/wiki/File:Master_Of_The_Avignon_School_-_Vision_of_Peter_of_Luxembourg_-_WGA14511.jpg

as duplicates? (Just as a pair of images I happen to have run across this morning).

They're very similar, though the smaller image is in fact sharper, a little darker, and slightly differently framed.

So I'd be interested whether they would ping the algorithm or not.

All best,

  James.


On 04/12/2014 09:43, Jonas Öberg wrote:
Hi everyone,

Careful here - algorithms that spot almost-duplicates will happily
flag different shots from the same shoot. Definitely not something to
act upon without close human inspection.

I agree, and I wouldn't want to flag anything automatically based on
our findings.

The algorithm we use is meant to capture verbatim re-use, not
derivative works. This means that it does a very poor job at matching
images that are different photographic reproductions of the same work
(light conditions, angles, borders, etc, will all differ). It does a
fairly good job at matching images that are verbatim copies, allowing
for resizing and format changes, but it's not perfect, and we
definitely end up with the same hash for some images, even if they're
not identical. This happens often with maps, for instance. For example
two maps of US states, one marking Washington in red and one marking
California in red. With no other differences, they'll end up hashed
very close to each other.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l



_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l

Reply via email to