[Wikitech-l] apihighlimits and bot flags
Dear all, We're building a Firefox addon to perceptually match images in Commons against images found elsewhere, so that people can see that they come from Commons even if they appear on other web sites. https://moqups.com/jonaso/lopej41Z has a quick mockup. On https://commons.wikimedia.org/wiki/Commons:Bots/Requests/CommonsHasher we've requested the apihighlimits right (after discussion on commons-l starting here: https://lists.wikimedia.org/pipermail/commons-l/2014-September/007325.html) in order to be able to retrieve more than 50 records at once from the API. According to EugeneZelenko who tried to grant this right, it could not be granted through the normal interface. Question then: is apihighlimits included in the bot flag, or how can the apihighlimits right be granted? Sincerely, -- Jonas Öberg, Founder & Shuttleworth Foundation Fellow Commons Machinery | jo...@commonsmachinery.se E-mail is the fastest way to my attention ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] apihighlimits and bot flags
Thanks Bartosz and Petr, much appreciated, this clears up the question nicely :) Sincerely, Jonas On 19 September 2014 10:24, Bartosz Dziewoński wrote: > Yes, the 'apihighlimits' *permission* is included in the 'bot' *group* (and > the 'sysop' group, too). You can see available groups and the permissions > they are assigned on > https://commons.wikimedia.org/wiki/Special:ListGroupRights > > -- > Bartosz Dziewoński > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Jonas Öberg, Founder & Shuttleworth Foundation Fellow Commons Machinery | jo...@commonsmachinery.se E-mail is the fastest way to my attention ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Open source mobile image recognition in Wikipedia
Hi Adrien, this looks very interesting - I'm happy to see your work and I briefly looked into your sources and API. With your 440 000 images, do you have any clear idea about the accuracy of ORB? To explain: I'm working on Elog.io, which provides a *similar* service and API[1] as yours, but uses a rather different algorithm and store, and a different use case. Our algorithm is a variant of a Blockhash[2] algorithm, which does not do any feature detection at all, but which can easily run in a browser or mobile platform (we have versions for JavaScript, C and Python) to generate 256 bit hashes of images. With a hamming distance calculation, we then determine the quality of a match. We work primarily on a use case of verbatim use, with a user getting images from Wikimedia and re-using them elsewhere. Algorithms without feature detection give very bad results for any modifications to an image, like rotating, cropping, etc. But since that's not within our use case, it works, though the flip side of if them is of course that you can't expect to photograph something (a newspaper article with an image for instance) and then match it against a set of images as you expect to be able to do. The other difference is that our database store isn't specifically tailored to our hashes: we use W3C Media Annotations to store any kind of metadata about images, and could equally well store your ORB signatures assuming they can be serialised. To give you some numbers, for our use cases (verbatim use, potentially with format change jpg->png etc, and scaling down to 100px width) we can successfully match ca 87% of cases, and we have a collision rate (different images resulting in same or near same hashes) of ca 1,2%. Both numbers against the Wikimedia Commons set. While we currently have the full ~22M images from Wikimedia Commons in our database, we're still ironing out the kinks of the system and making some additional improvements. If you think that we should consider ORB instead of or in addition to our current algorithms, we'd love to give that a try, and it'd obviously be very interesting if we could end up having compatible signatures compared to your database. Sincerely, Jonas [1] http://docs.cmcatalog.apiary.io [2] http://blockhash.io Jonas On 24 November 2014 at 11:25, Adrien Maglo wrote: > Hello, > > > I am not sure this is the right mailing list to introduce this project but I > have just released Displee. It is a small Android app that allows to search > for images in the English Wikipedia by taking pictures: > https://play.google.com/store/apps/details?id=org.visualink.displee > It is a kind of open source Google Goggles for images from en.wikipedia.org. > > I have developed Displee as a demonstrator of Pastec http://pastec.io, my > open source image recognition index and search engine for mobile apps. > The index hosted on my server in France currently contains about 440 000 > images. They may not be the most relevant ones but this is a start. ;-) > I have also other ideas to improve this tiny app if it has an interest for > the community. > > Displee source code (MIT) is available here: > https://github.com/Visu4link/displee > Pastec source code (LGPL) is available here: > https://github.com/Visu4link/pastec > The source code of the Displee back-end is not released yet. It is basically > a python3 Django application. > > I will be glad to receive your feedback and answer any question! > > Best regards, > > > -- > Adrien Maglo > Pastec developer > http://www.pastec.io > +33 6 27 94 34 41 > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Jonas Öberg, Founder & Shuttleworth Foundation Fellow Commons Machinery | jo...@commonsmachinery.se E-mail is the fastest way to my attention ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Open source mobile image recognition in Wikipedia
Hi Adrien! > Using the "visual word" approach I use in Pastec would enable the matching > of modified images but would also require a lot more resources. Thus, while > your hash is 256 bits long, an image signature in the Pastec index is > approximately 8 KB. 8 KB still isn't too bad. It sounds like it could be useful. > Similarly, I guess that the search complexity of your hash approach is o(1) > while in Pastec this is much more complicated: first "tf-idf" ranking and > then two geometrical rerankings... Close to o(1) at least. How does Pastec scale to many images? You mentioned having about 400,000 currently, which is still a rather fair number, but what about the full ~22M of Wikimedia Commons? I'm assuming that since tf-idf is a well known method for text mining, there are well understood and optimised algorithms to search. Perhaps something like Elasticsearch would be useful right away too? That would be an advantage, since with our blockhash, we've had to implement relevant search algorithms ourselves lacking existing implementations. One problem that we see and which was discussed recently on the commons-l mailing list, is the possibility of using approaches like yours and ours to identify duplicate images in Commons. We've generated a list of 21274 duplicate pairs, but some of them aren't actually duplicates, just very similar. Most commonly this is map data, like [1] and [2], where just a specific region differ. I'm hypothesizing that your ORB detection would have better success there, since it would hopefully detect the colored area as a feature and be able to distinguish the two from each other. In general, my feeling is that your work with ORB and our work with Blockhashes complement each other nicely. They work with different use cases, but have the same purpose, so being able to search using both would sometimes be an advantage. What is your strategy for scaling beyond your existing 400,000 images and is there some way we can cooperate on this? As we go about hashing additional sets (Flickr is a prime candidate), it would be interesting for us if we could generate both our blockhash and your ORB visual words signature in an easy way, since we any way retrieve the images. [1] https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Trujillo_Alto.png [2] https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Carolina.png ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Commons-l] Elog.io now up w/ Commons data
Hi Cornelius! For images which it match against the catalog, it should give accurate information. If it doesn't, use the "report" link to let us know! You're right though that for images it doesn't find in its catalog, we don't provide any information. That's the equivalent of saying "this picture may or may not be openly licensed, but right now we have no information to tell either way" Sincerely, Jonas On 11 Dec 2014 15:57, "Cornelius Kibelka" wrote: > Wow, what a nice and interesting browser extension. Congrats! > > Just a question: as far as I can see the tool doens't give the complete > and correction licensing information, as the source is missing. Or I'm > missleading? > > Best > Cornelius > > 2014-12-10 19:30 GMT+01:00 Jonas Öberg : > >> Dear all, >> >> thanks for all your help with answering questions and giving feedback >> over the last couple of months. I'm happy to say that we're finally at >> a stage where we've hashed 22,452,638 images from Wikimedia Commons >> and launched Elog.io in public beta: http://elog.io/ >> >> Elog.io is an open API as well as browser plugins, that can query and >> get information about images using a perceptual hash that's easy and >> quick to calculate in a browser. >> >> What the browser extensions allow you to do is match an image you find >> "in the wild" against Wikimedia Commons. If it can be matched against >> an image from Commons, it'll show you the title, author, and license, >> and give you links back to Wikimedia, the license, and a quick and >> handy "Copy as HTML" to copy the image and attribution as a HTML >> snippet for pasting into Word, LibreOffice, Wordpress, etc. >> >> Our API provides lookup functions to find information using a URL (the >> Commons' page name URL) or using the perceptual hash. You get >> information back as JSON in W3C Media Annotations format. of course, >> the information you get back is no better than the one provided by the >> Commons API, so if you already have a page name URL, you may as well >> query it directly, and rely on our API only for searching by >> perceptual hashes. >> >> The algorithm we use for calculating perceptual hashes, which you'll >> need to query our API, is at http://blockhash.io/ >> >> >> Sincerely, >> Jonas >> >> ___ >> Commons-l mailing list >> common...@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/commons-l >> > > > > -- > Cornelius Kibelka > > International Affairs > Werkstudent | student trainee > > Wikimedia Deutschland e.V. > Tempelhofer Ufer 23-24 > 10963 Berlin > > Tel.: +49 30 219158260 > http://wikimedia.de > > <http://wikimedia.de/>Stellen Sie sich eine Welt vor, in der jeder Mensch > freien Zugang zu der > Gesamtheit des Wissens der Menschheit hat. Helfen Sie uns dabei! > http://spenden.wikimedia.de/ > > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. > Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg > unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt > für Körperschaften I Berlin, Steuernummer 27/681/51985. > > ___ > Commons-l mailing list > common...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/commons-l > > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l