[Wikitech-l] apihighlimits and bot flags

2014-09-19 Thread Jonas Öberg
Dear all,

We're building a Firefox addon to perceptually match images in Commons
against images found elsewhere, so that people can see that they come
from Commons even if they appear on other web sites.
https://moqups.com/jonaso/lopej41Z has a quick mockup.

On https://commons.wikimedia.org/wiki/Commons:Bots/Requests/CommonsHasher
we've requested the apihighlimits right (after discussion on commons-l
starting here: 
https://lists.wikimedia.org/pipermail/commons-l/2014-September/007325.html)
in order to be able to retrieve more than 50 records at once from the
API.

According to EugeneZelenko who tried to grant this right, it could not
be granted through the normal interface. Question then: is
apihighlimits included in the bot flag, or how can the apihighlimits
right be granted?


Sincerely,

-- 
Jonas Öberg, Founder & Shuttleworth Foundation Fellow
Commons Machinery | jo...@commonsmachinery.se
E-mail is the fastest way to my attention

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] apihighlimits and bot flags

2014-09-19 Thread Jonas Öberg
Thanks Bartosz and Petr, much appreciated, this clears up the question nicely :)

Sincerely,
Jonas

On 19 September 2014 10:24, Bartosz Dziewoński  wrote:
> Yes, the 'apihighlimits' *permission* is included in the 'bot' *group* (and
> the 'sysop' group, too). You can see available groups and the permissions
> they are assigned on
> https://commons.wikimedia.org/wiki/Special:ListGroupRights
>
> --
> Bartosz Dziewoński
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Jonas Öberg, Founder & Shuttleworth Foundation Fellow
Commons Machinery | jo...@commonsmachinery.se
E-mail is the fastest way to my attention

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Open source mobile image recognition in Wikipedia

2014-11-24 Thread Jonas Öberg
Hi Adrien,

this looks very interesting - I'm happy to see your work and I briefly
looked into your sources and API. With your 440 000 images, do you
have any clear idea about the accuracy of ORB? To explain: I'm working
on Elog.io, which provides a *similar* service and API[1] as yours,
but uses a rather different algorithm and store, and a different use
case. Our algorithm is a variant of a Blockhash[2] algorithm, which
does not do any feature detection at all, but which can easily run in
a browser or mobile platform (we have versions for JavaScript, C and
Python) to generate 256 bit hashes of images. With a hamming distance
calculation, we then determine the quality of a match.

We work primarily on a use case of verbatim use, with a user getting
images from Wikimedia and re-using them elsewhere. Algorithms without
feature detection give very bad results for any modifications to an
image, like rotating, cropping, etc. But since that's not within our
use case, it works, though the flip side of if them is of course that
you can't expect to photograph something (a newspaper article with an
image for instance) and then match it against a set of images as you
expect to be able to do.

The other difference is that our database store isn't specifically
tailored to our hashes: we use W3C Media Annotations to store any kind
of metadata about images, and could equally well store your ORB
signatures assuming they can be serialised.

To give you some numbers, for our use cases (verbatim use, potentially
with format change jpg->png etc, and scaling down to 100px width) we
can successfully match ca 87% of cases, and we have a collision rate
(different images resulting in same or near same hashes) of ca 1,2%.
Both numbers against the Wikimedia Commons set.

While we currently have the full ~22M images from Wikimedia Commons in
our database, we're still ironing out the kinks of the system and
making some additional improvements. If you think that we should
consider ORB instead of or in addition to our current algorithms, we'd
love to give that a try, and it'd obviously be very interesting if we
could end up having compatible signatures compared to your database.

Sincerely,
Jonas

[1] http://docs.cmcatalog.apiary.io
[2] http://blockhash.io






Jonas

On 24 November 2014 at 11:25, Adrien Maglo  wrote:
> Hello,
>
>
> I am not sure this is the right mailing list to introduce this project but I
> have just released Displee. It is a small Android app that allows to search
> for images in the English Wikipedia by taking pictures:
> https://play.google.com/store/apps/details?id=org.visualink.displee
> It is a kind of open source Google Goggles for images from en.wikipedia.org.
>
> I have developed Displee as a demonstrator of Pastec http://pastec.io, my
> open source image recognition index and search engine for mobile apps.
> The index hosted on my server in France currently contains about 440 000
> images. They may not be the most relevant ones but this is a start. ;-)
> I have also other ideas to improve this tiny app if it has an interest for
> the community.
>
> Displee source code (MIT) is available here:
> https://github.com/Visu4link/displee
> Pastec source code (LGPL) is available here:
> https://github.com/Visu4link/pastec
> The source code of the Displee back-end is not released yet. It is basically
> a python3 Django application.
>
> I will be glad to receive your feedback and answer any question!
>
> Best regards,
>
>
> --
> Adrien Maglo
> Pastec developer
> http://www.pastec.io
> +33 6 27 94 34 41
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Jonas Öberg, Founder & Shuttleworth Foundation Fellow
Commons Machinery | jo...@commonsmachinery.se
E-mail is the fastest way to my attention

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Open source mobile image recognition in Wikipedia

2014-12-06 Thread Jonas Öberg
Hi Adrien!

> Using the "visual word" approach I use in Pastec would enable the matching
> of modified images but would also require a lot more resources. Thus, while
> your hash is 256 bits long, an image signature in the Pastec index is
> approximately 8 KB.

8 KB still isn't too bad. It sounds like it could be useful.

> Similarly, I guess that the search complexity of your hash approach is o(1)
> while in Pastec this is much more complicated: first "tf-idf" ranking and
> then two geometrical rerankings...

Close to o(1) at least. How does Pastec scale to many images? You
mentioned having about 400,000 currently, which is still a rather fair
number, but what about the full ~22M of Wikimedia Commons? I'm
assuming that since tf-idf is a well known method for text mining,
there are well understood and optimised algorithms to search. Perhaps
something like Elasticsearch would be useful right away too?

That would be an advantage, since with our blockhash, we've had to
implement relevant search algorithms ourselves lacking existing
implementations.

One problem that we see and which was discussed recently on the
commons-l mailing list, is the possibility of using approaches like
yours and ours to identify duplicate images in Commons. We've
generated a list of 21274 duplicate pairs, but some of them aren't
actually duplicates, just very similar. Most commonly this is map
data, like [1] and [2], where just a specific region differ.

I'm hypothesizing that your ORB detection would have better success
there, since it would hopefully detect the colored area as a feature
and be able to distinguish the two from each other.

In general, my feeling is that your work with ORB and our work with
Blockhashes complement each other nicely. They work with different use
cases, but have the same purpose, so being able to search using both
would sometimes be an advantage. What is your strategy for scaling
beyond your existing 400,000 images and is there some way we can
cooperate on this? As we go about hashing additional sets (Flickr is a
prime candidate), it would be interesting for us if we could generate
both our blockhash and your ORB visual words signature in an easy way,
since we any way retrieve the images.

[1] 
https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Trujillo_Alto.png
[2] https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Carolina.png

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Commons-l] Elog.io now up w/ Commons data

2014-12-11 Thread Jonas Öberg
Hi Cornelius!

For images which it match against the catalog, it should give accurate
information. If it doesn't, use the "report" link to let us know!

You're right though that for images it doesn't find in its catalog, we
don't provide any information. That's the equivalent of saying "this
picture may or may not be openly licensed, but right now we have no
information to tell either way"

Sincerely,
Jonas
On 11 Dec 2014 15:57, "Cornelius Kibelka" 
wrote:

> Wow, what a nice and interesting browser extension. Congrats!
>
> Just a question:  as far as I can see the tool doens't give the complete
> and correction licensing information, as the source is missing. Or I'm
> missleading?
>
> Best
> Cornelius
>
> 2014-12-10 19:30 GMT+01:00 Jonas Öberg :
>
>> Dear all,
>>
>> thanks for all your help with answering questions and giving feedback
>> over the last couple of months. I'm happy to say that we're finally at
>> a stage where we've hashed 22,452,638 images from Wikimedia Commons
>> and launched Elog.io in public beta: http://elog.io/
>>
>> Elog.io is an open API as well as browser plugins, that can query and
>> get information about images using a perceptual hash that's easy and
>> quick to calculate in a browser.
>>
>> What the browser extensions allow you to do is match an image you find
>> "in the wild" against Wikimedia Commons. If it can be matched against
>> an image from Commons, it'll show you the title, author, and license,
>> and give you links back to Wikimedia, the license, and a quick and
>> handy "Copy as HTML" to copy the image and attribution as a HTML
>> snippet for pasting into Word, LibreOffice, Wordpress, etc.
>>
>> Our API provides lookup functions to find information using a URL (the
>> Commons' page name URL) or using the perceptual hash. You get
>> information back as JSON in W3C Media Annotations format. of course,
>> the information you get back is no better than the one provided by the
>> Commons API, so if you already have a page name URL, you may as well
>> query it directly, and rely on our API only for searching by
>> perceptual hashes.
>>
>> The algorithm we use for calculating perceptual hashes, which you'll
>> need to query our API, is at http://blockhash.io/
>>
>>
>> Sincerely,
>> Jonas
>>
>> ___
>> Commons-l mailing list
>> common...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/commons-l
>>
>
>
>
> --
> Cornelius Kibelka
>
> International Affairs
> Werkstudent | student trainee
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
>
> Tel.: +49 30 219158260
> http://wikimedia.de
>
> <http://wikimedia.de/>Stellen Sie sich eine Welt vor, in der jeder Mensch
> freien Zugang zu der
> Gesamtheit des Wissens der Menschheit hat. Helfen Sie uns dabei!
> http://spenden.wikimedia.de/
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt
> für Körperschaften I Berlin, Steuernummer 27/681/51985.
>
> ___
> Commons-l mailing list
> common...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l