[Wikitech-l] Blame maps aka authorship detection

2013-02-25 Thread Luca de Alfaro
Dear All,

Michael Shavlovky and I have been working on blame maps (authorship
detection) for the various Wikipedias.
We have code in the WikiMedia repository that has been written with the
goal to obtain a production system capable of attributing all content (not
just a research demo).  Here are some pointers:

   - Code 
   - Description of the blame maps mediawiki
extension
   - Detailed description of the underlying algorithm, with performance
   
evaluation
   - Demo 

These are also all available from
https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project
In brief, for each page we store metadata that summarizes the entire text
evolution of the page; this metadata, compressed, is about three times the
size of a typical revision.  Each time a new revision is made, we read this
metadata, attribute every word of the revision, store updated metadata, and
store authorship data for the revision.  The process takes 1-2 seconds
depending on the average revision size (most of the time is actually
devoted to deserializing and reserializing the metadata).  Comparing with
all previous revisions takes care of things like content that is deleted
and then later re-inserted, and other various attacks that might happen
once authorship is displayed.  I should also add that these algorithms are
independent from the ones in WikiTrust, and should be much better.

We have NOT developed a GUI for this: our plan was just to provide a data
API that gives information on authorship of each word.  There are many ways
to display the information, from page summaries of authorship to detailed
word-by-word information, and we thought that surely others would want to
play with the visualization aspect.

I am writing this message as we hope this might be of interest, and as we
would be quite happy to find people willing to collaborate.  Is anybody
interested in developing a GUI for it and talk to us about what API we
should have for retrieving this authorship information?  Is there anybody
interested in helping to move the code to production-ready stage?

I also would like to mention that Fabian Floeck has developed another very
interesting algorithm for attributing the content, reported in
http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_Andriy_Rodchenko.pdf
Fabian and I are now starting to collaborate: we want to compare the
algorithms, and work together to obtain something we are happy with, and
that can run in production.

Indeed, I think a reasonable first goal would be to:

   - Define a data API
   - Define some coarse requirements of the system
   - Have a look at above results / algorithms / implementation and advise
   us.

I am sure that the algorithm details can be fine tuned and changed to no
end in a collaborative effort, once the first version is up and running.
 The problem is of putting together a bit of effort to get to that first
running version.

Luca
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-25 Thread Luca de Alfaro
I agree: in fact we don't do it in the write pipeline.  The code we wrote
implements a simple queue, where page_id are queued for processing.  The
processing job then gets a page_id out of that table, and processes all the
missing revisions for that page_id.  So this is useful also if (say) there
is a page merge or something similar: we can just erase all authorship
information for that page, and at the next edit, it will be rebuilt.

What we wrote can work also on labs, but:

   - We need a way to poll the database for things like what are all
   revision_ids of a given page.  We could use the API instead, but it's less
   efficient.
   - We need a way to read the text of revisions.  Again, the API can work,
   but having better access is better.
   - We need a place where to store the authorship information.  This is
   several terabytes for enwiki.  Basically, we need access to some text
   store.  Is this available on labs?

We would welcome more information on how much of the above is feasible on
labs.

Luca

On Mon, Feb 25, 2013 at 7:27 PM, Matthew Flaschen
wrote:

> On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
> > I am writing this message as we hope this might be of interest, and as we
> > would be quite happy to find people willing to collaborate.  Is anybody
> > interested in developing a GUI for it and talk to us about what API we
> > should have for retrieving this authorship information?  Is there anybody
> > interested in helping to move the code to production-ready stage?
>
> Are you planning to run this live in production (i.e. 1-2 seconds on
> every save)?
>
> I think people would be reluctant to slow writes down further.  You
> could potentially do it deferred, or in the job queue, but I think it
> might make more sense on something like Wikimedia Labs
> (https://www.mediawiki.org/wiki/Wikimedia_Labs)
>
> Did you try doing it with no caching (similar to git blame, though I
> know it's a different algorithm)?  I'm wondering how much benefit you
> get from the cached info.
>
> Matt Flaschen
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Missing project ideas for GSOC

2013-03-20 Thread Luca de Alfaro
Would there be interest in integrating the work on authorship computation?
This would not be an extension; it would be ... server-side development
that could fit well with a Summer of Code?

Luca

On Wed, Mar 20, 2013 at 4:43 PM, Quim Gil  wrote:

> It's time to start defining what we want our Google Summer of Code to be
> all about. Let's look at the ideas we are proposing to potential students:
>
> https://www.mediawiki.org/**wiki/Mentorship_programs/**Possible_projects
>
> Many of the ideas listed there are too generic ("Write an extension"),
> improvements of existing features ("Improve Extension:CSS") or
> work-in-progress tasks ("Fix Parsoid bugs"). Many others are not directly
> related with development, and therefore not suitable either for GSOC.
>
> After this filtering, we seem to be left with:
>
> * Article evolution playback tool idea
> * An easy way to share wiki content on social media services
> * Write an extension to support XML Sitemaps without using command line
> * Extension:OEmbedProvider
> * Add support for x3d 3D files to MediaWiki
> * Allow smoother and easier Wikimedia Commons pictures discovery
> * Build an interwiki notifications framework and implement it for
> InstantCommons
> * Automatic category redirects
>
> (If you think your project should also be considered here please speak up!)
>
> Most of these projects seem to be extension (and PHP?) centric. Can we
> have more diversity? Maybe gadgets and templates are too simple for a GSOC
> project? What about the mobile front? Do we have skin development projects
> that could make it here? Anything in the DevOps area? Anything the
> MediaWiki core maintainers would like to see happening?
>
> It would be also nice to have more candidates benefiting specific
> Wikimedia projects. Beyond Wikipedia, we have several proposals related to
> Commons. Wikidata seems to be joining soon. What else? Could this be a
> chance to help Wiktionary, Wikibooks or any other project with specific
> needs craving for tech attention?
>
> Also to the many students that have already showed their interest: feel
> free pushing your project ideas now!
>
> --
> Quim Gil
> Technical Contributor Coordinator @ Wikimedia Foundation
> http://www.mediawiki.org/wiki/**User:Qgil
>
> __**_
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Authorship Tracking project for Google Summer of Code - looking for mentors

2013-05-01 Thread Luca de Alfaro
The project is inspired by WikiTrust, but unrelated.
We have re-designed from scratch an algorithm that provides authorship
tracking, and that is simpler, easier to maintain, and more efficient than
WikiTrust.

Part of the reason is that WikiTrust is difficult to maintain... in fact, I
would really love to find a good home for it -- some place where it can run
in a production environment, which UCSC is not.  So, the project we are
proposing would make at least one of the outputs of WikiTrust available in
a stable way, and with better quality.



On Wed, May 1, 2013 at 4:12 PM, Yury Katkov  wrote:

> Hi!
>
> Is it related to WikiTrust [1]? They managed to track every word's
> authorship and even assign content-driven reputation to the authors. I'm a
> big fan of that project
>
> [1] http://www.wikitrust.net/
> -
> Yury Katkov, WikiVote
>
>
>
> On Thu, May 2, 2013 at 3:07 AM, Michael Shavlovsky  >wrote:
>
> > Hi,
> >
> > I am an applicant for Google Summer of Code and proposing to build  an
> > Authorship Tracking mediawiki extension.
> > I am looking for mentors and would appreciate feedback on the proposal
> > which can be found here
> > https://www.mediawiki.org/wiki/User:Mshavlovsky/Authorship_Tracking .
> >
> > Many many thanks,
> > Michael
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Enable WikiTrust spanish support

2011-03-24 Thread Luca de Alfaro
Hi All,

yes, I think we could bring up support for WikiTrust on the Spanish
Wikipedia for this purpose.
The way we worked with Martin Walker for the English project is that he gave
us a list of page_ids, and we gave back a csv file with, for each page_ids,
the recommended revision_ids, each with a quality indication, and other
information (timestamps, other useful metadata...) and I think Martin
basically just followed the recommendation.

How far away are you from having a list of page_ids?
If we could support this on our existing server, it should not be too much
work for us to set it up.
Let us know.

I apologize for the delay in answering!

Luca

On Thu, Mar 24, 2011 at 5:29 AM, Wilfredor  wrote:

> Yours sincerely,
>
> Has long tried to start a Wikipedia 1.0 project in Spanish
> (http://es.wikipedia.org/wiki/Wikipedia:Wikipedia_en_CD). Project
> similar to the English version.
>
> The problem is that I have been unable to contact WikiTrust team
> (http://www.wikitrust.net/authors). We need the support of the Spanish
> system, which does not exist yet.
>
> I apologize in advance if this is not the right place.
>
> Thank you very much.
>
> --
> User:Wilfredor
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l