Re: [Wikitech-l] Open source mobile image recognition in Wikipedia

2014-12-06 Thread Jonas Öberg
Hi Adrien!

 Using the visual word approach I use in Pastec would enable the matching
 of modified images but would also require a lot more resources. Thus, while
 your hash is 256 bits long, an image signature in the Pastec index is
 approximately 8 KB.

8 KB still isn't too bad. It sounds like it could be useful.

 Similarly, I guess that the search complexity of your hash approach is o(1)
 while in Pastec this is much more complicated: first tf-idf ranking and
 then two geometrical rerankings...

Close to o(1) at least. How does Pastec scale to many images? You
mentioned having about 400,000 currently, which is still a rather fair
number, but what about the full ~22M of Wikimedia Commons? I'm
assuming that since tf-idf is a well known method for text mining,
there are well understood and optimised algorithms to search. Perhaps
something like Elasticsearch would be useful right away too?

That would be an advantage, since with our blockhash, we've had to
implement relevant search algorithms ourselves lacking existing
implementations.

One problem that we see and which was discussed recently on the
commons-l mailing list, is the possibility of using approaches like
yours and ours to identify duplicate images in Commons. We've
generated a list of 21274 duplicate pairs, but some of them aren't
actually duplicates, just very similar. Most commonly this is map
data, like [1] and [2], where just a specific region differ.

I'm hypothesizing that your ORB detection would have better success
there, since it would hopefully detect the colored area as a feature
and be able to distinguish the two from each other.

In general, my feeling is that your work with ORB and our work with
Blockhashes complement each other nicely. They work with different use
cases, but have the same purpose, so being able to search using both
would sometimes be an advantage. What is your strategy for scaling
beyond your existing 400,000 images and is there some way we can
cooperate on this? As we go about hashing additional sets (Flickr is a
prime candidate), it would be interesting for us if we could generate
both our blockhash and your ORB visual words signature in an easy way,
since we any way retrieve the images.

[1] 
https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Trujillo_Alto.png
[2] https://commons.wikimedia.org/wiki/File:Locator_map_Puerto_Rico_Carolina.png

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Structured data: use of schema.org's structured data for Wikisource works (or similar)

2014-12-06 Thread Wiki Billinghurst
Dear Wikitech (cc'd Wikisource)

A recent discussion in English Wikisource's Scriptorium was querying
why commercial book companies, etc. were getting higher search hits,
especially where they may just have summary information, rather than
full text. In that discussion someone pointed to some of the webmaster
information at Google, eg. [1], which (ultimately) talks about their
microformat (preferred) or JSON-LD as a means to put in more
particular metadata as explained at schema.org (for creative works
[2])

I went to play, and ultimately failed, and was pointed to the
inability to script for security reasons, and the inability to add
micodata (cite user=bawolffmicrodata attributes are implemented in
MediaWiki, but currently disabled via
$wgAllowMicrodataAttributes/cite thx).

So my naive questions to those that know these things are
1) How do we look to improve external search engine hits for the
sister sites where they are particularly pertinent to a search
[wikipedia already gets Google special treatment]
2) if the schema.org metadata is a preferred means to progress, what
is the recommended means to progress such an issue
3) presumably some of this fits into the discussion about Structured
Data discussion, and what means is there to include this into that
discussion?

Thanks for the guidance.

Regards, Billinghurst


[1] https://support.google.com/webmasters/answer/3227642?hl=enref_topic=370
[2] http://www.schema.org/CreativeWork

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l